<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[The Code Sonnet Scribe]]></title><description><![CDATA[Software Engineer | Musician | Traveller]]></description><link>https://vaishnave.page</link><generator>RSS for Node</generator><lastBuildDate>Tue, 14 Apr 2026 05:47:59 GMT</lastBuildDate><atom:link href="https://vaishnave.page/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[The Engineer's Logs - #3]]></title><description><![CDATA[Every engineer has two journeys:One is the work we ship.The other is the growth we quietly build behind the scenes.

The Engineer’s Logs is my attempt to capture that second journey.
These notes and logs are rough and candid, meant for me to revisit ...]]></description><link>https://vaishnave.page/the-engineers-logs-3</link><guid isPermaLink="true">https://vaishnave.page/the-engineers-logs-3</guid><category><![CDATA[engineering]]></category><category><![CDATA[technology]]></category><category><![CDATA[Databricks]]></category><category><![CDATA[Testing]]></category><category><![CDATA[APIs]]></category><dc:creator><![CDATA[Vaishnave Subbramanian]]></dc:creator><pubDate>Fri, 16 May 2025 18:30:17 GMT</pubDate><content:encoded><![CDATA[<blockquote>
<p><strong>Every engineer has two journeys:</strong><br />One is the work we ship.<br />The other is the growth we quietly build behind the scenes.</p>
</blockquote>
<p><em>The Engineer’s Logs</em> is my attempt to capture that second journey.</p>
<p><em>These notes and logs are rough and candid, meant for me to revisit over time. But I figured—why not share them here too!</em></p>
<h1 id="heading-this-weeks-tech-and-other-musings">🔎 <strong>This Week’s (Tech and other) Musings</strong></h1>
<h2 id="heading-what-i-consumed">🎧 <strong>What I Consumed</strong></h2>
<ul>
<li><p><strong>Zepto</strong> launches <em>Atom</em>, giving brands real-time insights into their product performance on the platform.</p>
</li>
<li><p><strong>Databricks</strong> ventures into the OLTP space with its acquisition of <em>Neon</em>, an AI-assisted, cloud-native, serverless Postgres engine. With Neon, AI agents can <em>spin up temporary Postgres branches</em> to run, test, and discard without human input.</p>
</li>
</ul>
<h2 id="heading-what-i-learnt">🎧 <strong>What I Learnt</strong></h2>
<ul>
<li><p>I worked on unit testing for a couple of APIs. Learnt a couple of things:</p>
<ul>
<li><p>The flow I followed for setup was:</p>
<ul>
<li><p>Added the project’s root directory to sys.path so imports work everywhere without ugly relative paths.</p>
</li>
<li><p>For isolation, every test runs against its own in-memory SQLite database.</p>
</li>
<li><p>The helper <em>fresh_db</em> (an asynccontextmanager in conftest.py) does the heavy lifting. An <strong>async context manager</strong> bundles “setup → yield resource → teardown” for asynchronous code. It produces an object you use with “async with” anywhere in your application or tests.</p>
<ul>
<li><p>before yield: creates the SQLite engine, builds all tables, then opens and returns an AsyncSession;</p>
</li>
<li><p>after the test exits: disposes the engine, wiping the database.</p>
</li>
</ul>
</li>
<li><p><em>db_session</em> is a pytest fixture simply yields fresh_db’s session. A <strong>pytest fixture</strong> is a reusable, plug-and-play helper that prepares something your test needs and then cleans it up afterward. Using the db_session as an argument inside your test implies:</p>
<ul>
<li><p>Pytest inspects the test function’s parameter list.</p>
</li>
<li><p>It sees a parameter named db_session and looks for a fixture with that name.</p>
</li>
<li><p>It runs the fixture, which in turn runs fresh_db, creates a brand-new AsyncSession, and yields it.</p>
</li>
<li><p>The yielded session object is fed into the test function as the argument db_session.</p>
</li>
<li><p>After the test body completes (pass or fail), execution resumes after the yield inside the fixture, which closes/disposes the SQLite engine—so nothing leaks into the next test.</p>
</li>
</ul>
</li>
<li><p>Most test files also declare a small seeded_session helper. It calls fresh_db, pre-populates rows that the particular test needs, then yields the session—so every test starts with relevant data.</p>
</li>
<li><p>During startup we register a few MySQL-style SQL functions with SQLite so that any raw SQL used in the codebase runs unchanged.</p>
</li>
<li><pre><code class="lang-plaintext">      async test  ───&gt;  requests fixture  ───&gt;  fixture enters context manager
                    ^                       |
                    |                       v
                    &lt;────  uses session  &lt;─── yields session
</code></pre>
</li>
</ul>
</li>
<li><p>All input and expected data was separated out into a data file for tests. This made it easy to tweak the parameters.</p>
</li>
<li><p>To summarise:</p>
<ul>
<li><p>Pytest finds the file →</p>
</li>
<li><p>creates a clean SQLite database →</p>
</li>
<li><p>seeds a few rows →</p>
</li>
<li><p>runs the business-logic function you’re interested in →</p>
</li>
<li><p>checks the return values/query results →</p>
</li>
<li><p>tears everything down</p>
</li>
</ul>
</li>
</ul>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[The Engineer's Logs - #2]]></title><description><![CDATA[Every engineer has two journeys:One is the work we ship.The other is the growth we quietly build behind the scenes.

The Engineer’s Logs is my attempt to capture that second journey.
These notes and logs are rough and candid, meant for me to revisit ...]]></description><link>https://vaishnave.page/the-engineers-logs-2</link><guid isPermaLink="true">https://vaishnave.page/the-engineers-logs-2</guid><category><![CDATA[technology]]></category><category><![CDATA[Google]]></category><category><![CDATA[Apple]]></category><category><![CDATA[AWS]]></category><dc:creator><![CDATA[Vaishnave Subbramanian]]></dc:creator><pubDate>Sun, 11 May 2025 17:54:52 GMT</pubDate><content:encoded><![CDATA[<blockquote>
<p><strong>Every engineer has two journeys:</strong><br />One is the work we ship.<br />The other is the growth we quietly build behind the scenes.</p>
</blockquote>
<p><em>The Engineer’s Logs</em> is my attempt to capture that second journey.</p>
<p><em>These notes and logs are rough and candid, meant for me to revisit over time. But I figured—why not share them here too!</em></p>
<h1 id="heading-this-weeks-tech-and-other-musings">🔎 <strong>This Week’s (Tech and other) Musings</strong></h1>
<h2 id="heading-what-i-consumed">🎧 <strong>What I Consumed</strong></h2>
<ul>
<li><p>I love <a target="_blank" href="https://podcasts.apple.com/us/podcast/journalism-comes-in-second-government-threatens-wikipedias/id73329404?i=1000706317419">This Week In Tech</a>—this podcast roasts everybody, stays refreshingly unbiased, and is probably the only show that can hold my attention.</p>
<ul>
<li><p><em>Law…huh…yeah, What is it good for?</em> — Apple’s take on antitrust rulings: held in contempt for ignoring the 2021 injunction…because apparently, laws are optional. Cue Tim Cook’s hair flip.</p>
</li>
<li><p>Visa annouces plan to give agents your credit card info- Agents that learn your tastes can charge your credit card for things you’re likely to enjoy—with almost zero human oversight. It’s unsettling—picture those agents’ software updates tied to your buying power, upgrading themselves. RIP, paycheck.</p>
</li>
<li><p>Google’s got its own dilemma—imagine Sundar on vacation in India, riding in an auto rickshaw, weighing whether to ditch Chrome or surrender his search data. Spoiler: how about no.</p>
</li>
</ul>
</li>
<li><p>Meanwhile, during the <a target="_blank" href="https://www.youtube.com/watch?v=VJniL3C-Tac">Google Cloud Summit 2025, India</a>:</p>
<ul>
<li><p><strong>Gemini 2.5 Enhancements</strong></p>
<ul>
<li><p>Significantly upgraded reasoning and contextual understanding compared to prior Gemini versions</p>
</li>
<li><p>True multi-modal operation: it can ingest and output text, images, and video in a single pipeline, streamlining content-driven business processes</p>
</li>
</ul>
</li>
<li><p><strong>Global Infrastructure Build-out</strong></p>
<ul>
<li><p>Deployment of new subsea fiber cables “Umoja” and “Bosoon” to expand international bandwidth and reduce latency</p>
</li>
<li><p>These cables underpin Google’s cloud backbone, boosting throughput for the 2 billion+ AI models and datasets hosted in Workspace</p>
</li>
</ul>
</li>
<li><p><strong>Healthcare AI at Manipal Hospital</strong></p>
<ul>
<li><p>AI-driven monitoring during nurse handovers minimizes errors and accelerates patient data updates</p>
</li>
<li><p>Automated alerts highlight critical vitals changes, improving response times.</p>
</li>
</ul>
</li>
<li><p><strong>Enterprise AI Adoption by Wipro</strong></p>
<ul>
<li><p>Integration of Vertex AI pipelines with Gemini for code generation, model tuning, and automated testing</p>
</li>
<li><p>Migrated large-scale workloads to Google Cloud 30% faster by leveraging <strong>AI-powered migration assistants</strong></p>
</li>
</ul>
</li>
<li><p><strong>AI First Accelerator Program</strong></p>
<ul>
<li><p>Joint initiative with MeitY Startup Hub targeting 10,000 Indian startups over next 3 years</p>
</li>
<li><p>Focus areas include healthcare AI, retail analytics, and sustainable-tech solutions</p>
</li>
</ul>
</li>
<li><p><strong>Agent Development Kit (ADK)</strong></p>
<ul>
<li><p>Framework for building autonomous “agents” that can plan, reason, and execute tasks based on <strong>live data feeds.</strong> <em>Damn.</em></p>
</li>
<li><p>Includes SDKs for Python and Java, pre-built connectors to BigQuery, Pub/Sub, and third-party APIs</p>
</li>
<li><p>Simplifies integration into existing microservices architectures</p>
</li>
</ul>
</li>
<li><p><strong>Cloud Assist &amp; Code Assist Tools</strong></p>
<ul>
<li><p>Embedded within popular IDEs to surface context-aware code snippets, bug fixes, and documentation links</p>
</li>
<li><p>Automates repetitive refactoring tasks—up to 40% time savings on boilerplate code</p>
</li>
<li><p>Improves code quality by suggesting security patches and performance optimizations in real time.</p>
</li>
<li><p><em>I used to type out scenarios and request boilerplate code from ChatGPT, but with tools like Cursor, Copilot, and Lovable, development time has been cut down drastically.</em></p>
</li>
</ul>
</li>
<li><p><strong>Emergence of Agentive Applications</strong></p>
<ul>
<li><p>Next-generation apps that proactively manage workflows—e.g., booking meetings, collating reports, or handling customer queries without direct prompts</p>
</li>
<li><p>Agents can inter-communicate, forming dynamic “<strong>agent networks</strong>” to solve complex, multi-step problems</p>
</li>
<li><p><em>I can imagine a bunch of agents sitting and gossiping about the fact that I missed yet another meeting that they set up. Ugh.</em></p>
</li>
</ul>
</li>
<li><p><strong>Vision for the Future</strong></p>
<ul>
<li>Google positions these advances as building blocks for an AI-driven economy- by combining powerful models, robust infrastructure, and human expertise, the goal is to unlock new levels of efficiency and innovation across every industry.</li>
</ul>
</li>
</ul>
</li>
</ul>
<p>… <em>Google’s rolling out flashy AI bells and whistles just to remind everyone they’re still in the game, while OpenAI dominates the space.</em></p>
<p>Sorry Google, no consolation prizes in the AI world.</p>
<h2 id="heading-what-i-learnt">🎧 <strong>What I Learnt</strong></h2>
<p>🟡 <strong>AWS Learning Snapshot</strong><br />I worked on getting an understanding of the AWS ecosystem, console navigation, key benefits, and core services like S3. I also covered the AWS Well-Architected Framework and its pillars, learned how AWS helps secure systems, and explored cloud migration strategies using AWS CAF and services like Snowball.</p>
<p>🔜 <strong>Up Next:</strong><br />I’ll be diving into <strong>Cloud Economics and Global Infrastructure</strong>—understanding AWS cost optimization, right-sizing, connectivity options, and how AWS’s global architecture works. After that, I’ll get hands-on with the <strong>Core AWS Services</strong> including Compute, Database, and Storage, plus topics like serverless vs server-based solutions and creating S3 buckets.</p>
<p>I want to run through these super quickly so I can spend some time on AWS lambda functions, API gateway and SNS.</p>
]]></content:encoded></item><item><title><![CDATA[The Engineer’s Logs - #1]]></title><description><![CDATA[Every engineer has two journeys:One is the work we ship.The other is the growth we quietly build behind the scenes.

The Engineer’s Logs is my attempt to capture that second journey.
These notes and logs are rough and candid, meant for me to revisit ...]]></description><link>https://vaishnave.page/the-engineers-logs-1</link><guid isPermaLink="true">https://vaishnave.page/the-engineers-logs-1</guid><category><![CDATA[software development]]></category><category><![CDATA[learning]]></category><category><![CDATA[technology]]></category><dc:creator><![CDATA[Vaishnave Subbramanian]]></dc:creator><pubDate>Thu, 01 May 2025 08:59:24 GMT</pubDate><content:encoded><![CDATA[<blockquote>
<p><strong>Every engineer has two journeys:</strong><br />One is the work we ship.<br />The other is the growth we quietly build behind the scenes.</p>
</blockquote>
<p><em>The Engineer’s Logs</em> is my attempt to capture that second journey.</p>
<p><em>These notes and logs are rough and candid, meant for me to revisit over time. But I figured—why not share them here too!</em></p>
<h1 id="heading-this-weeks-tech-and-other-musings">🔎 <strong>This Week’s (Tech and other) Musings</strong></h1>
<h2 id="heading-what-i-consumed">🎧 <strong>What I Consumed</strong></h2>
<ul>
<li><p><a target="_blank" href="https://podcasts.apple.com/us/podcast/never-lick-a-badger-twice-selling-chrome-modular/id73329404?i=1000705217980">This Week In Tech</a> brought up something interesting about Mark - Is Zuckerberg bitter about Instagram’s success over Facebook, his own product?</p>
</li>
<li><p>Is Google going to lose Chrome… to OpenAI??</p>
</li>
<li><p>Do we <em>really</em> need a foldable iPhone? My battery still hasn’t forgiven Apple for killing the audio jack.</p>
</li>
<li><p><a target="_blank" href="https://www.cnbctv18.com/technology/sam-altman-backed-startup-rolls-out-eyeball-scanning-tech-across-us-19597393.htm">Eyeball scanning tech</a> - are we really ready for this from a privacy standpoint? Not exactly fun to risk my biometric data ending up on the Dark Web.</p>
</li>
<li><p><a target="_blank" href="https://readwise.io/reader/shared/01jsxbn3rntd1wkvkaja7v2dgs">Ben Kuhn from Anthropic</a> talks about intentional high <strong>Agency</strong> and good <strong>Taste</strong>, and how these two superpowers combined produce a big impact-per-hour multiplier.</p>
</li>
<li><p>Jean Lee, Whatsapp’s 19th engineer talks about likeability and credibility being two underrated assets that can go a long way together. She also talks about finding fulfilment at work. Kinda boils down to understanding <strong>why</strong> you do what you do everyday.</p>
</li>
<li><p>Bye bye, GPT-4. You had a good run.</p>
</li>
</ul>
<h2 id="heading-what-i-learnt">🎧 <strong>What I Learnt</strong></h2>
<ul>
<li><p>Started understanding more about AWS cloud (coming from an Azure background).</p>
</li>
<li><p>Started with <strong>Alex Chiou’s</strong> talks on maximising your productivity as a software engineer on Taro.</p>
<ul>
<li><p>Learnt a lot of interesting things that I can practically use in my day to day work planning and prioritisation. One of the things he introduced is the <strong>Impact Time matrix</strong> where you assign the highest points to the work with the highest impact and the lowest time.</p>
</li>
<li><p>The other thing he mentioned is surrounding yourself with people that you do not want to let down. Essentially, he highlights the <strong>effectiveness of social pressure over work pressure</strong>. I am a huge believer in this and want to surround myself with good people that I enjoy working with. He also brought up going into office and virtual co-working as being effective methods of daily motivation and accountability.</p>
</li>
</ul>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Sparking Solutions]]></title><description><![CDATA[Introduction to Spark Optimization
Optimizing Spark can dramatically improve performance and reduce resource consumption. Typically, optimization in Spark can be approached from three distinct levels: cluster level, code level, and CPU/memory level. ...]]></description><link>https://vaishnave.page/sparking-solutions</link><guid isPermaLink="true">https://vaishnave.page/sparking-solutions</guid><category><![CDATA[spark]]></category><category><![CDATA[spark optimizations]]></category><category><![CDATA[optimization]]></category><category><![CDATA[distributed system]]></category><category><![CDATA[PySpark]]></category><category><![CDATA[performance]]></category><category><![CDATA[Performance Optimization]]></category><category><![CDATA[caching]]></category><category><![CDATA[joins]]></category><category><![CDATA[apache]]></category><dc:creator><![CDATA[Vaishnave Subbramanian]]></dc:creator><pubDate>Sun, 29 Sep 2024 11:49:35 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1727607602226/94c32da1-7e15-49e4-8222-f0df557ea12b.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-introduction-to-spark-optimization">Introduction to Spark Optimization</h1>
<p>Optimizing Spark can dramatically improve performance and reduce resource consumption. Typically, optimization in Spark can be approached from three distinct levels: cluster level, code level, and CPU/memory level. Each level addresses different aspects of Spark's operation and contributes uniquely to overall efficiency in distributed processing with Spark.</p>
<h4 id="heading-cluster-level-optimization">Cluster Level Optimization</h4>
<p>At the cluster level, optimization involves configuring the Spark environment to efficiently manage and utilize the hardware resources available across the cluster. This includes <strong>tuning resource allocation settings such as executor memory, core counts, and maximizing data locality to reduce network overhead</strong>. Effective cluster management ensures that Spark jobs are allocated the right amount of resources, <strong>balancing between under-utilization and over-subscription</strong>.</p>
<h4 id="heading-code-level-optimization">Code Level Optimization</h4>
<p>Code level optimization refers to the improvement of the actual Spark code. It involves refining the data processing operations to make better use of Spark’s execution engine. This includes optimizing the way data transformations and actions are structured, minimizing shuffling of data across the cluster, and employing Spark's <strong>Catalyst optimizer</strong> to enhance SQL query performance. Writing efficient Spark code not only speeds up processing times but also reduces the computational overhead on the cluster.</p>
<h4 id="heading-cpumemory-level-optimization">CPU/Memory Level Optimization</h4>
<p>At the CPU/memory level, the focus shifts to the internals of how Spark jobs are executed on each node. This entails tuning garbage collection processes, managing serialization and deserialization of data, and optimizing memory usage patterns to prevent spilling and excessive garbage collection. These optimizations, utilizing the <strong>Tungsten optimizer</strong>, enhance the utilization of available CPU and memory resources, which is essential for efficiently processing large-scale data.</p>
<p>In this article, we will mostly be diving into code level optimisation based on the code we worked with in the previous articles.</p>
<h1 id="heading-narrow-and-wide-transformations">Narrow and Wide Transformations</h1>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727603645345/8c51643c-6bfb-4d80-93ce-ff17d08ae1d1.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-narrow-transformations">Narrow Transformations</h2>
<p>Narrow transformations involve operations where data required to compute the elements of the resulting RDD can be derived from a single partition of the parent RDD. This means <strong>there is no need for data to be shuffled across partitions or across nodes in the cluster</strong>. Each partition of the parent RDD contributes to only one partition of the child RDD.</p>
<p><strong>Examples of Narrow Transformations:</strong></p>
<ul>
<li><p><code>map()</code>: Applies a function to each element in the RDD.</p>
</li>
<li><p><code>filter()</code>: Returns a new RDD containing only the elements that satisfy a predicate.</p>
</li>
<li><p><code>flatMap()</code>: Similar to map, but each input item can be mapped to zero or more output items.</p>
</li>
</ul>
<p>Because data does not need to move between partitions, narrow transformations are generally more efficient and faster.</p>
<h2 id="heading-wide-transformations">Wide Transformations</h2>
<p>Wide transformations, also known as shuffle transformations, involve operations where <strong>data needs to be shuffled across partitions or even across different nodes</strong>. This happens because the data required to compute the elements in a single partition of the resulting RDD may reside in many partitions of the parent RDD.</p>
<p><strong>Examples of Wide Transformations:</strong></p>
<ul>
<li><p><code>groupBy()</code>: Groups data according to a specified key.</p>
</li>
<li><p><code>reduceByKey()</code>: Merges the values for each key using an associative reduce function.</p>
</li>
<li><p><code>join()</code>: Joins different RDDs based on their keys.</p>
</li>
</ul>
<p><strong>Wide transformations are costlier than narrow transformations</strong> because they involve shuffling data, which is an expensive operation in terms of network and disk I/O.</p>
<h1 id="heading-spark-computation">Spark Computation</h1>
<p>When you write and run Spark code, how does it execute from start to finish to produce the output?</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727603337873/72c17b48-c2b1-4b31-995e-a9c6282d6b07.png" alt class="image--center mx-auto" /></p>
<ul>
<li><p><strong>Number of Jobs</strong>: In Spark, a job is triggered by an action. Each action triggers the creation of a job, which orchestrates the execution of stages needed to return the results of the action.</p>
<p>  <code>Number of jobs = Number of actions</code></p>
</li>
<li><p><strong>Number of Stages</strong>: Stages in Spark are sets of tasks that are executed together on the same cluster node and can run in parallel. Stages are divided by wide transformations.</p>
<p>  <code>Number of stages = Number of wide transformations + 1</code></p>
</li>
<li><p>This +1 accounts for the final stage that gathers results following any necessary shuffles.</p>
</li>
<li><p><strong>Number of Tasks</strong>: Tasks are the smallest units of work in Spark and correspond to individual operations that can be performed on data partitions in parallel. The number of tasks is directly proportional to the number of partitions of the RDD (Resilient Distributed Dataset). A single task is spawned for each partition for every stage. Therefore, <em>adjusting the number of partitions can significantly affect parallelism and performance</em>.</p>
<p>  <code>Number of tasks = Number of partitions</code></p>
</li>
</ul>
<h1 id="heading-catalyst-optimiser-the-internals">Catalyst Optimiser - The Internals</h1>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1727604976692/dd9930f7-0087-490d-8050-902c53d95a02.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-unresolved-logical-plan">Unresolved Logical Plan</h2>
<p>This is the initial plan generated directly from the user's query, representing the operations required without detailed knowledge of the actual data types and structure. If there are syntax-level issues such as missing brackets, incorrect function names, or syntactical errors, this plan will generate errors because it can't correctly interpret the query.</p>
<h2 id="heading-analyzed-logical-plan">Analyzed Logical Plan</h2>
<p>After the initial parsing, the query's logical plan is analyzed. This step involves resolving column and table names against the existing schema catalog. It checks whether the columns, tables, and functions referenced in the query actually exist and are accessible. If there are errors in the names or if referenced tables or columns don't exist, the query will fail at this stage.</p>
<h2 id="heading-optimized-logical-plan-rule-based-optimization">Optimized Logical Plan - Rule-Based Optimization</h2>
<p>Once the logical plan is analyzed and validated, it undergoes rule-based optimizations. These optimizations aim to simplify the query and improve execution efficiency:</p>
<ul>
<li><h3 id="heading-predicate-pushdown"><strong>Predicate Pushdown</strong></h3>
<p>  Filters are applied as early as possible to reduce the amount of data processed in subsequent steps.</p>
</li>
<li><h3 id="heading-projection-pushdown"><strong>Projection Pushdown</strong></h3>
<p>  Only necessary columns are retrieved, skipping over unneeded data, which minimizes I/O.</p>
</li>
<li><h3 id="heading-partition-pruning"><strong>Partition Pruning</strong></h3>
<p>  Reduces the number of partitions to scan by excluding those that do not meet the query's filter criteria.</p>
</li>
</ul>
<h2 id="heading-physical-plan">Physical Plan</h2>
<p>The optimized logical plan is then converted into one or more physical plans. These plans outline specific strategies for executing the query, including how joins, sorts, and aggregations are to be physically implemented on the cluster. The Catalyst Optimizer evaluates various strategies to find the most efficient one based on the actual data layout and distribution.</p>
<h2 id="heading-cost-model-evaluation">Cost Model Evaluation</h2>
<p>From the possible physical plans, the Catalyst Optimizer uses a cost model to determine the best execution strategy. This model considers:</p>
<ul>
<li><p>Data distribution</p>
</li>
<li><p>Disk I/O</p>
</li>
<li><p>Network latency</p>
</li>
<li><p>CPU usage</p>
</li>
<li><p>Memory consumption</p>
</li>
</ul>
<p>The optimizer assesses each plan's cost based on these factors and selects the one with the lowest estimated cost for execution.</p>
<h2 id="heading-code-generation">Code Generation</h2>
<p>This final step involves generating optimized bytecode or SQL code that can be executed by Spark's engine. The code is specifically tailored to be efficient based on the physical plan and the underlying execution environment (like JVM bytecode for operations on RDDs). The SQL code is further optimised by the SQL engine.</p>
<h2 id="heading-execution-of-optimized-rdds">Execution of Optimized RDDs</h2>
<p>The generated code leads to the execution of the plan on Spark's distributed infrastructure, creating optimized RDDs that carry out the intended data transformations and actions as defined by the query.</p>
<h1 id="heading-caching">Caching</h1>
<p><strong>Cache</strong> and <strong>persist</strong> are mechanisms for optimizing data processing by storing intermediate results that can be reused in subsequent stages of data transformations. Here's a concise explanation of both:</p>
<h2 id="heading-cache">Cache</h2>
<ul>
<li><p><code>cache()</code> in Spark stores the RDD, DataFrame, or Dataset in memory (default storage level), making it quickly accessible, speeding up future actions that reuse this data.</p>
</li>
<li><p>The data is stored as deserialized objects in the JVM memory. This means the objects are ready for computation but occupy more space compared to serialized data.</p>
</li>
</ul>
<h2 id="heading-persist">Persist</h2>
<ul>
<li><p><code>persist()</code> method allows you to specify different storage levels, offering more flexibility compared to <code>cache()</code>.</p>
</li>
<li><p>You can choose how Spark stores the data, whether in memory only, disk only, memory and disk, and more. The choice affects the performance and efficiency of data retrieval and processing.</p>
<ul>
<li><h3 id="heading-memoryonly"><strong>MEMORY_ONLY</strong></h3>
<p>  Default for <code>persist()</code>, similar to <code>cache()</code>, stores data as deserialized objects in memory.</p>
</li>
<li><h3 id="heading-diskonly"><strong>DISK_ONLY</strong></h3>
<p>  Useful for very large datasets that do not fit into memory. It avoids the overhead of recomputing the data by storing it directly on the disk.</p>
</li>
<li><h3 id="heading-memoryanddisk"><strong>MEMORY_AND_DISK</strong></h3>
<p>  Stores partitions in memory, but if the memory is insufficient, it spills them to disk. This can help handle larger datasets than memory can accommodate without re-computation.</p>
</li>
<li><h3 id="heading-offheap"><strong>OFF_HEAP</strong></h3>
<p>  Available only in Java and Scala, this option stores data outside of the JVM heap using serialized objects. This is useful for managing memory more efficiently, especially with large heaps. We’ll discuss this further when looking at the tungsten optimizer in the future articles in this series.</p>
</li>
</ul>
</li>
<li><p><strong>PySpark Specifics</strong>: In PySpark, data is always stored as serialized objects even when using <code>MEMORY_ONLY</code>. This helps reduce the JVM overhead but might increase the computation cost because objects need to be deserialized before processing.</p>
</li>
</ul>
<p>By using caching or persistence strategically in Spark applications, you can significantly reduce the time spent on recomputation of intermediate results, especially for iterative algorithms or complex transformations. This leads to performance improvements in processing workflows where the same dataset is accessed multiple times.</p>
<h1 id="heading-using-spark-ui-for-optimizations">Using Spark UI for Optimizations</h1>
<p>Spark UI offers invaluable tools like lineage graphs and stage/task analysis to enhance your application's performance. Here’s how to effectively use these features:</p>
<h2 id="heading-lineage-graphs-for-debugging">Lineage Graphs for Debugging</h2>
<p>The lineage graph in Spark UI visually maps out the RDD or DataFrame transformations, providing a clear and detailed trace of the data flow from start to finish. This visual representation is crucial for understanding how your data is being transformed across various stages of your Spark application, and it plays a key role in identifying inefficiencies:</p>
<ul>
<li><h3 id="heading-identify-narrow-and-wide-transformations"><strong>Identify Narrow and Wide Transformations</strong></h3>
<p>  By examining the lineage graph, you can see whether certain stages involve wide transformations, which can cause substantial shuffling and are likely performance bottlenecks.</p>
</li>
<li><h3 id="heading-trace-data-processing-paths"><strong>Trace Data Processing Paths</strong></h3>
<p>  The lineage allows you to trace the path taken by your data through transformations. This can help in understanding unexpected results or errors in data processing.</p>
</li>
<li><h3 id="heading-optimization-opportunities"><strong>Optimization Opportunities</strong></h3>
<p>  The graph can highlight redundant or unnecessary operations. For example, multiple filters might be consolidated, or cache operations could be strategically placed to optimize data retrieval in subsequent actions.</p>
</li>
</ul>
<h2 id="heading-stage-and-task-analysis">Stage and Task Analysis</h2>
<p>The Stage and Task tabs in Spark UI provide a deeper dive into the execution of individual stages and tasks within those stages. They offer metrics such as scheduler delay, task duration, and shuffle read/write stats, which are critical for diagnosing performance issues:</p>
<ul>
<li><h3 id="heading-task-skew-identification"><strong>Task Skew Identification</strong></h3>
<p>  By analyzing task durations and shuffle read/write statistics, you can identify data skew, where certain tasks take much longer than others, suggesting an <em>imbalance in data partitioning</em>.</p>
</li>
<li><h3 id="heading-memory-allocation-metrics"><strong>Memory Allocation Metrics</strong></h3>
<p>  By keeping an eye on these metrics, you can fine-tune the memory settings in your Spark application. This helps minimize pauses caused by garbage collection—a process where Spark clears out unused data from memory—and ensures memory is used more efficiently. We will discuss more about this in the future articles.</p>
</li>
<li><h3 id="heading-performance-tuning"><strong>Performance Tuning</strong></h3>
<p>  Adjustments in your code can be made based on these insights. For example, increasing or decreasing the level of parallelism, modifying join types, or altering the way data is partitioned and distributed across the cluster.</p>
</li>
</ul>
<h2 id="heading-adjusting-code-based-on-insights">Adjusting Code Based on Insights</h2>
<p>Based on the information from the lineage graph and stage/task analysis, you can return to your code to make specific adjustments:</p>
<ul>
<li><h3 id="heading-refactor-code-for-efficient-data-processing"><strong>Refactor Code for Efficient Data Processing</strong>:</h3>
<p>  Simplify transformation chains where possible, merge operations logically to minimize the creation of intermediate data, and ensure transformations are placed optimally to reduce shuffling.</p>
</li>
<li><h3 id="heading-tune-data-shuffling-shuffle-partitions"><strong>Tune Data Shuffling</strong> (Shuffle Partitions)</h3>
<p>  Adjust configurations related to data shuffling, such as <code>spark.sql.shuffle.partitions</code>, to optimize the distribution of data across tasks and nodes.</p>
</li>
<li><h3 id="heading-leverage-caching"><strong>Leverage Caching</strong></h3>
<p>  Decide more strategically where to cache data based on the understanding of which datasets are reused extensively across different stages of the application.</p>
</li>
</ul>
<p>Here is a <a target="_blank" href="https://blog.scottlogic.com/2018/03/14/apache-spark-question-everything.html">simple example of how to understand and utilize the Spark UI</a>.</p>
<h1 id="heading-optimisation-cheat-sheet">Optimisation Cheat Sheet</h1>
<h2 id="heading-dos">Dos</h2>
<ol>
<li><p>Focus on <strong>per node computation strategy</strong> over optimising <strong>node to node communication strategy</strong> i.e <strong>minimize data shuffling</strong>: Design transformations to minimize the need for shuffling data across the network. <strong>Use narrow transformations instead of wide transformations wherever you can.</strong> Example - When reducing the number of partitions, use <code>coalesce</code>, (a narrow transformation) over <code>repartition</code> (a wide transformation).</p>
<h3 id="heading-coalesce-and-repartition">Coalesce and Repartition</h3>
<p> These are used to adjust the number of partitions in an RDD or DataFrame. <code>coalesce</code> is used to decrease the number of partitions, while you can both increase and decrease the number of partitions using <code>repartition</code>.</p>
<pre><code class="lang-python"> <span class="hljs-comment"># Example DataFrame</span>
 df = spark.range(<span class="hljs-number">1000</span>)

 <span class="hljs-comment"># Using coalesce to reduce the number of partitions</span>
 df_coalesced = df.coalesce(<span class="hljs-number">5</span>)
 print(<span class="hljs-string">f"Number of partitions after coalescing: <span class="hljs-subst">{df_coalesced.rdd.getNumPartitions()}</span>"</span>)

 <span class="hljs-comment"># Using repartition to change the number of partitions</span>
 df_repartitioned = df.repartition(<span class="hljs-number">10</span>)
 print(<span class="hljs-string">f"Number of partitions after repartitioning: <span class="hljs-subst">{df_repartitioned.rdd.getNumPartitions()}</span>"</span>)
</code></pre>
</li>
<li><p><strong>Use DataFrames and Datasets</strong>: Instead of using RDDs, use DataFrames and Datasets wherever possible. These abstractions allow the Catalyst Optimizer to apply advanced optimizations like predicate pushdown, projection pushdown, and partition pruning.</p>
</li>
<li><p><strong>Filter Early</strong>: Apply filters as early as possible in your transformations. This <em>reduces the amount of data processed in later stages</em>, enhancing performance.</p>
</li>
<li><p><strong>Use Explicit Partitioning</strong>: When you know your data distribution, use explicit partitioning to optimize query performance. This helps in operations like <code>join</code> and <code>groupBy</code>, which can benefit significantly from data being partitioned appropriately.</p>
</li>
<li><p>Use the best joining strategy according to use case:</p>
<h3 id="heading-joining-strategies-recap">Joining Strategies Recap</h3>
<p> Here is a recap of the join strategies sorted by best to worst in terms of performance:</p>
<ul>
<li><p><strong>Broadcast Join (Map-Side Join)</strong>:</p>
<ul>
<li><p><strong>Best performance</strong> for smaller datasets.</p>
</li>
<li><p>Efficient when one dataset is small enough (typically less than 10MB) to be broadcast to all nodes, avoiding the need for shuffling the larger dataset.</p>
</li>
</ul>
</li>
<li><p><strong>Sort Merge Join</strong>:</p>
<ul>
<li><p>Good for larger datasets that are sorted or can be sorted efficiently.</p>
</li>
<li><p>Involves sorting both datasets and then merging, which is less resource-intensive than shuffling but more than a broadcast join.</p>
</li>
</ul>
</li>
<li><p><strong>Shuffle Hash Join</strong>:</p>
<ul>
<li><p>Suitable for medium-sized datasets.</p>
</li>
<li><p>Both datasets are partitioned and shuffled by the join key, then joined using a hash table. More expensive than sort merge due to the cost of building hash tables but less than Cartesian join.</p>
</li>
</ul>
</li>
<li><p><strong>Cartesian Join (Cross Join)</strong>:</p>
<ul>
<li><p><strong>Worst performance</strong>; should be avoided if possible.</p>
</li>
<li><p>Every row of one dataset is joined with every row of the other, leading to a massive number of combinations and significant resource consumption.</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p><strong>Leverage Broadcast variables for Small Data</strong>: When joining a small DataFrame with a large DataFrame, use broadcast variables and joins to minimize data shuffling by broadcasting the smaller DataFrame to all nodes.</p>
</li>
<li><p><strong>Cache Strategically</strong>: Cache DataFrames that are used multiple times in your application. Caching can help avoid recomputation, but it should be used judiciously to avoid excessive memory consumption.</p>
</li>
<li><p><strong>Use Explain Plans and Lineage graphs in Spark UI</strong>: Regularly use the <code>EXPLAIN</code> command to understand the physical and logical plans that Catalyst generates. Additionally, using the Spark UI to examine the lineage of RDDs and DataFrames can further enhance your understanding of how Spark executes your queries, enabling more targeted optimizations.</p>
</li>
</ol>
<h2 id="heading-donts">Don'ts</h2>
<ol>
<li><p><strong>Avoid Unnecessary Columns</strong>: Do not select more columns than you need. Use projection pushdown to limit the columns read from disk, especially when dealing with large datasets.</p>
</li>
<li><p><strong>Don't Neglect Data Skew</strong>: Be cautious of operations on skewed data, which can lead to inefficient resource utilization and long processing times. <a target="_blank" href="https://vaishnave.page/spark-illuminated#heading-skew-join-optimization">Try to understand and manage data skew, possibly by redistributing the data more evenly and using skew join optimisations.</a></p>
</li>
<li><p><strong>Avoid Frequent Schema Changes</strong>: Frequent changes to the data schema can lead to plan recomputation and reduce the benefits of query plan caching.</p>
</li>
<li><p><strong>Don't Overuse UDFs</strong>: While User Defined Functions (UDFs) are powerful, they can hinder Catalyst's ability to fully optimize query plans because the optimizer cannot always infer what the UDF does. Whenever possible, use built-in Spark SQL functions.</p>
</li>
<li><p><strong>Avoid Large Broadcasts</strong>: Broadcasting very large datasets can consume a lot of memory and network bandwidth.</p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[Sparks Fly]]></title><description><![CDATA[File Formats
In the realm of data storage and processing, file formats play a pivotal role in defining how information is organized, stored, and accessed.

These formats, ranging from simple text files to complex structured formats, serve as the blue...]]></description><link>https://vaishnave.page/sparks-fly</link><guid isPermaLink="true">https://vaishnave.page/sparks-fly</guid><category><![CDATA[spark]]></category><category><![CDATA[#apache-spark]]></category><category><![CDATA[Parquet]]></category><category><![CDATA[avro]]></category><category><![CDATA[orc]]></category><category><![CDATA[sparksql]]></category><category><![CDATA[json]]></category><category><![CDATA[csv]]></category><category><![CDATA[serialization]]></category><category><![CDATA[deserialization]]></category><category><![CDATA[encoding]]></category><category><![CDATA[Python]]></category><category><![CDATA[PySpark]]></category><category><![CDATA[file formats]]></category><dc:creator><![CDATA[Vaishnave Subbramanian]]></dc:creator><pubDate>Thu, 04 Apr 2024 17:45:20 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1712252275592/4f34c7ab-a0f7-4b32-a417-a864c5c53b0a.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-file-formats">File Formats</h1>
<p>In the realm of data storage and processing, file formats play a pivotal role in defining how information is organized, stored, and accessed.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1711884928961/a22b6a7b-506b-4226-b8d0-4f68ab617874.png" alt class="image--center mx-auto" /></p>
<p>These formats, ranging from simple text files to complex structured formats, serve as the blueprint for data encapsulation, enabling efficient data retrieval, analysis, and interchange. Whether it's the widely used CSV and JSON for straightforward data transactions, or more specialized formats like Avro, ORC, and Parquet designed for high-performance computing environments, each file format brings unique advantages tailored to specific data storage needs and processing workflows.</p>
<h2 id="heading-comprehensive-snapshot-of-file-formats">Comprehensive Snapshot of File Formats</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1711885100580/cf0d28ec-132e-4a40-930e-47e097c694f2.png" alt class="image--center mx-auto" /></p>
<p>Here's a table summarizing the characteristics, advantages, limitations, and use cases of the mentioned file formats:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Format</strong></td><td><strong>Characteristics</strong></td><td><strong>Advantages</strong></td><td><strong>Limitations</strong></td><td><strong>Use Cases</strong></td></tr>
</thead>
<tbody>
<tr>
<td>Comma-Separated Values (CSV)</td><td>Simple, all data as strings, line-based.</td><td>Simplicity, human readability.</td><td>Space inefficient, slow I/O for large datasets.</td><td>Small datasets, simplicity required.</td></tr>
<tr>
<td>XML/JSON</td><td>Hierarchical, complex structures, human-readable.</td><td>Suitable for complex data, human-readable.</td><td>Non-splittable, can be verbose.</td><td>Web services, APIs, configurations.</td></tr>
<tr>
<td>Avro</td><td>Row-based, schema in header, write optimized.</td><td>Best for schema evolution, write efficiency, landing zone suitability (1), predicate pushdown (2).</td><td>Slower reads for subset of columns.</td><td>Data lakes, ETL operations, microservices architechture, schema evolving environments.</td></tr>
<tr>
<td>Optimized Row Columnar (ORC)</td><td>Columnar, efficient reads, optimized for Hive.</td><td>Read efficiency, storage efficiency, predicate pushdown.</td><td>Inefficient writes, designed for specific use cases (Hive).</td><td>Data warehousing, analytics platforms.</td></tr>
<tr>
<td>Parquet</td><td>Columnar, efficient with nested data, self-describing, schema evolution.</td><td>Read efficiency, handles nested data, schema evolution, predicate pushdown.</td><td>Inefficient writes, complex for simple use cases.</td><td>Analytics, machine learning, complex data models.</td></tr>
</tbody>
</table>
</div><p>(1) Landing zone suitability refers to how well a data storage format or system accommodates the initial collection, storage, and basic processing of raw data in a <strong>preparatory or intermediate stage</strong> before further analytics or transformation.</p>
<p>(2) Predicate pushdown is an optimization technique that filters data as close to the source as possible, reducing the amount of data processed by pushing down the filter conditions to the storage layer. (ORC, Avro, Parquet)</p>
<h2 id="heading-comma-separated-values-csv"><strong>Comma-Separated Values (CSV)</strong></h2>
<ul>
<li><p>Despite its simplicity, CSV does not enforce any schema, leading to potential issues with data consistency and integrity. It’s extremely flexible but requires careful handling to avoid errors, such as field misalignment or varying data formats.</p>
</li>
<li><p>CSV files can be used as a lowest-common-denominator data exchange format due to their <em>wide support across programming languages, databases, and spreadsheet applications</em>.</p>
</li>
</ul>
<p><img src="https://giftoolscookbook.readthedocs.io/en/latest/_images/CSVfileEx.png" alt="5.1.2. CSV file format — GIFtoolsCookbook 1.0 documentation" class="image--center mx-auto" /></p>
<h2 id="heading-xmljson"><strong>XML/JSON</strong></h2>
<p><strong>XML</strong>:</p>
<ul>
<li><p>XML is highly extensible, allowing developers to create their tags and data structures, but this flexibility can lead to <em>increased file size and complexity</em>.</p>
</li>
<li><p>XML is often used in SOAP-based web services and as a configuration language in many software applications due to its structured and hierarchical nature.</p>
</li>
</ul>
<p><strong>JSON</strong>:</p>
<ul>
<li><p>JSON's lightweight and text-based structure make it ideal for mobile and web applications where bandwidth may be limited.</p>
</li>
<li><p>JSON has become the preferred choice for RESTful APIs due to its simplicity and ease of use with JavaScript, allowing for seamless data interchange between clients and servers.</p>
</li>
</ul>
<p><img src="https://1818760.fs1.hubspotusercontent-na1.net/hub/1818760/hubfs/Screen%20Shot%202021-12-08%20at%209.19.43%20AM.png?width=558&amp;name=Screen%20Shot%202021-12-08%20at%209.19.43%20AM.png" alt="What Is a JSON File? And What Role Does It Play in eDiscovery?" class="image--center mx-auto" /></p>
<h2 id="heading-avro-json-on-steroids"><strong>Avro:</strong> JSON on Steroids</h2>
<ul>
<li><p>Avro's support for direct mapping to and from JSON makes it easy to understand and use, especially for developers familiar with JSON. This feature <strong>enables efficient data serialization</strong> without compromising human readability during schema design.</p>
</li>
<li><p>Avro is particularly <em>favored in streaming data architectures</em>, such as Apache Kafka, where schema evolution and efficient data serialization are crucial for handling large volumes of real-time data.</p>
</li>
</ul>
<h3 id="heading-overview-of-avro-file-structure"><strong>Overview of Avro File Structure</strong></h3>
<p><img src="https://miro.medium.com/v2/resize:fit:1400/0*INWiFWNRNOvCgDq0.png" alt="Big Data File Formats, Explained. Parquet vs ORC vs AVRO vs JSON. Which… |  by 💡Mike Shakhomirov | Towards Data Science" class="image--center mx-auto" /></p>
<p>The file structure can be broken down into two main parts: the header and the data blocks.</p>
<h4 id="heading-header">Header</h4>
<p>The header is the first part of the Avro file and contains metadata about the file, including:</p>
<ul>
<li><p><strong>Magic Bytes</strong>: A sequence of bytes (specifically, "Obj" followed by a null byte) used to identify the file as Avro format.</p>
</li>
<li><p><strong>Metadata</strong>: A map (key-value pairs) containing information such as the schema (in JSON format), codec (compression method used, if any), and any other user-defined metadata. The schema in the metadata describes the structure of the data stored in the file, defining fields, data types, and other relevant details.</p>
</li>
<li><p><strong>Sync Marker</strong>: A 16-byte, randomly-generated marker used to separate blocks in the file. The same sync marker is used throughout a single Avro file, ensuring that data blocks can be efficiently separated and processed in parallel, if necessary.</p>
</li>
</ul>
<h4 id="heading-data-blocks">Data Blocks</h4>
<p>Following the header, the file contains one or more data blocks, each holding a chunk of the actual data:</p>
<ul>
<li><p><strong>Block Size and Count</strong>: Each block starts with two integers indicating the number of serialized records in the block and the size of the serialized data (in bytes), respectively.</p>
</li>
<li><p><strong>Serialized Data</strong>: The actual data, serialized according to the Avro schema defined in the header. The serialization is binary, making it compact and efficient for storage and transmission.</p>
</li>
<li><p><strong>Sync Marker</strong>: Each block ends with the same sync marker found in the header, serving as a delimiter to mark the end of a block and facilitate efficient data processing and integrity checks.</p>
</li>
</ul>
<h3 id="heading-schema-evolution"><strong>Schema Evolution</strong></h3>
<p>Schema evolution in Avro allows for backward and forward compatibility, making it possible to modify schemas over time without disrupting existing applications. This feature is critical for long-lived data storage systems and for applications that evolve independently of the data they process.</p>
<ol>
<li><p><strong>Backward Compatibility</strong>: Older data can be read with a newer schema, enabling applications to evolve without requiring simultaneous updates to all data.</p>
</li>
<li><p><strong>Forward Compatibility</strong>: Newer data can be read with an older schema, allowing legacy systems to access updated data formats.</p>
</li>
</ol>
<h2 id="heading-optimized-row-columnar-orc"><strong>Optimized Row Columnar (ORC)</strong></h2>
<ul>
<li><p>ORC files include lightweight indexes that store statistics (such as min, max, sum, and count) about the data in each column, allowing for more efficient data retrieval operations. This indexing significantly enhances performance for data warehousing queries.</p>
</li>
<li><p>The format is designed to optimize the performance of Hive queries. By storing data in a columnar format, ORC minimizes the amount of data read from disk, improving query performance, especially for analytical queries that aggregate large volumes of data.</p>
</li>
</ul>
<p><img src="https://docs.cloudera.com/cdw-runtime/cloud/hive-performance-tuning/images/hive_orc_file_structure.png" alt="ORC file format" class="image--center mx-auto" /></p>
<h3 id="heading-file-structure"><strong>File Structure</strong></h3>
<p>An ORC file consists of multiple components, each serving a specific function to enhance storage and query capabilities:</p>
<ol>
<li><p><strong>File Header</strong>: This section marks the ORC file's beginning, detailing the file format version and other overarching properties.</p>
</li>
<li><p><strong>Stripes</strong>: The core data storage units within an ORC file are its stripes, substantial blocks of data usually set between 64MB and 256MB. Stripes contain a subset of the data, compressed and encoded independently, which supports both parallel processing and efficient data access. For example, in a 500MB ORC file, you might find two stripes, each carrying 250MB and consisting of 1 lakh (100,000) rows.</p>
</li>
<li><p><strong>Stripe Header</strong>: Holds metadata about the stripe, including the row count, stripe size, and specifics on the stripe's indexes and data streams.</p>
</li>
<li><p><strong>Row Groups</strong>: Data within each stripe is organized into row groups, with each grouping a set of rows to be processed collectively. Suppose each stripe is divided into 10 row groups, with each group containing 10,000 records. This structuring facilitates efficient processing by batching the columnar data of several rows.</p>
</li>
<li><p><strong>Column Storage</strong>: Within the row groups, data is stored column-wise. Different data types may leverage distinct compression and encoding strategies, optimizing storage and read operations.</p>
</li>
<li><p><strong>Indexes</strong>: Integral to ORC's efficiency are the lightweight indexes within each stripe, detailing the contained data. These indexes, including metrics like minimum and maximum values and row counts, enable precise data filtering without necessitating a full data read.</p>
</li>
<li><p><strong>Stripe Footer</strong>: Contains detailed metadata about the stripe, such as encoding types per column and stream directories.</p>
</li>
<li><p><strong>File Footer</strong>: Summarizes the file by listing stripes and indexes for all levels, detailing the data schema, and aggregating statistics like column row counts and min/max values.</p>
</li>
<li><p><strong>Postscript</strong>: Holds file metadata, including encoding and compression details, file footer length, and serialization library versions.</p>
</li>
</ol>
<h3 id="heading-indexes-and-storage-optimization"><strong>Indexes and Storage Optimization</strong></h3>
<p>Each stripe possesses a set of indexes that hold vital data statistics:</p>
<ul>
<li><p><strong>File Level</strong>: Offers global statistics for the entire file.</p>
</li>
<li><p><strong>Stripe Level</strong>: Provides specific statistics for each stripe, facilitating efficient data querying by enabling the system to skip irrelevant stripes.</p>
</li>
<li><p><strong>Row Group Index</strong>: Within each stripe, enabling finer control over data access by detailing row group contents.</p>
</li>
</ul>
<h2 id="heading-parquet"><strong>Parquet</strong></h2>
<ul>
<li><p>Parquet supports efficient compression and encoding schemes such as dictionary encoding, which can significantly reduce the size of the data stored. This feature makes it an excellent choice for cost-effective storage in cloud environments.</p>
</li>
<li><p>Parquet’s columnar storage model is not just beneficial for analytical querying; it also supports complex nested data structures, making it suitable for semi-structured data like JSON. This adaptability makes it a popular choice for big data ecosystems and data lake architectures, where data comes in various shapes and sizes.</p>
</li>
</ul>
<h3 id="heading-columnar-storage"><strong>Columnar Storage</strong></h3>
<p>Unlike traditional row-based storage, where data is stored as a sequence of rows, Parquet stores data in a columnar format. This means each column is stored separately, enabling more efficient data retrieval and compression, especially for analytical workloads where only a subset of columns may be needed for a query.</p>
<h3 id="heading-row-groups"><strong>Row Groups</strong></h3>
<p>Similar to row groups in stripes in the ORC format, a row group in Parquet is a vertical partitioning of data into columns within a certain chunk of rows. This organization allows for both columnar storage benefits and efficient data scanning across rows.</p>
<p>Let's make this a little clearer- The data is organized into columns that are further divided into row groups. Each row group contains a segment of the dataset's rows, but stores this data in a column-oriented manner.</p>
<h3 id="heading-example-scenario"><strong>Example Scenario</strong></h3>
<p>Let's take an example where data about products and countries is stored in Parquet format. In a traditional row store, a query filtering for specific product types and countries would need to scan every row. However, with Parquet's structure, the engine can directly access the columns for products and countries, and within those, only the row groups relevant to the query criteria, significantly reducing the amount of data processed. (Reference for a better understanding- <a target="_blank" href="https://data-mozart.com/parquet-file-format-everything-you-need-to-know/">Data Mozart's article on Parquet</a>)</p>
<h3 id="heading-efficiency-in-analytical-queries"><strong>Efficiency in Analytical Queries</strong></h3>
<h4 id="heading-projection-and-predicate-pushdown">Projection and Predicate Pushdown</h4>
<p>Parquet's file structure is particularly advantageous for analytical queries that involve projection (selecting specific columns) and predicates (applying filters on rows). The columnar storage allows for projection by accessing only the needed columns, while the row groups facilitate predicate pushdown by enabling selective scanning of rows based on the filters applied.</p>
<h3 id="heading-metadata"><strong>Metadata</strong></h3>
<p>Metadata in Parquet files includes details about the data's structure, such as schema information, and statistics about the data stored within the file.</p>
<h4 id="heading-facilitating-data-skips">Facilitating Data Skips</h4>
<p>With metadata, query engines can perform optimizations such as predicate pushdown. By providing detailed information about the data's organization and content, the metadata allows query engines to quickly determine which parts of the data need to be read and which can be skipped.</p>
<h4 id="heading-reducing-io-operations">Reducing I/O Operations</h4>
<p>By minimizing the need to read irrelevant data from disk, metadata significantly reduces I/O operations, a common bottleneck in data processing. This efficiency is crucial for handling large datasets where reading the entire dataset from disk would be prohibitively time-consuming and resource-intensive.</p>
<h3 id="heading-metadata-structure"><strong>Metadata Structure</strong></h3>
<p>Every Parquet file contains a footer section, which holds the metadata. This includes format version, schema information and column metadata like statistics for individual columns within each row group.</p>
<p><img src="https://data-mozart.com/wp-content/uploads/2023/04/Format-1024x576.png" alt /></p>
<h3 id="heading-encoding"><strong>Encoding</strong></h3>
<h4 id="heading-run-length-encoding-rle"><strong>Run Length Encoding (RLE)</strong></h4>
<p>Run-Length Encoding is a straightforward yet powerful compression technique. Its primary objective is to reduce the size of data that has numerous repeatitive occurrences of the same value (runs). Instead of storing each value individually, RLE compresses a run of data into two elements: the value itself and the count of how many times the value repeats consecutively.</p>
<ol>
<li><p><strong>Identification of Runs</strong>: RLE scans the data to identify runs, which are sequences where the same value occurs consecutively.</p>
</li>
<li><p><strong>Compression</strong>: Each run is then compressed into a pair: the value and its count. For example, the sequence <code>AAAABBBCC</code> is compressed to <code>A4B3C2</code>.</p>
</li>
<li><p><strong>Efficiency</strong>: The efficiency of RLE is highly dependent on the data's nature. It performs exceptionally well on data with many long runs but might not be effective for data with high variability.</p>
</li>
</ol>
<h4 id="heading-run-length-encoding-with-bit-packing"><strong>Run Length Encoding with Bit Packing</strong></h4>
<p>Combining RLE with bit-packing enhances its compression capability, especially for numerical data with a limited range of values. Bit-packing is a technique that minimizes the number of bits needed to represent a value, packing multiple numbers into a single byte or word in memory.</p>
<p>Let me expand on bit-packing a little more to make it easier.</p>
<p>A byte consists of 8 bits, and traditionally, a whole byte might be used to store a single piece of information. But if your information can be represented with fewer bits, you can store multiple pieces of information in the same byte.</p>
<p>For example, for integers ranging from 0 to 15, you only need 4 bits to represent all possible values. Why? Because 4 bits can create 16 different combinations (2^4). Here’s how the bit representation looks for these numbers:</p>
<ul>
<li><p>The number 0 is <code>0000</code> in binary.</p>
</li>
<li><p>The number 1 is <code>0001</code>.</p>
</li>
<li><p>...</p>
</li>
<li><p>The number 15 is <code>1111</code>.</p>
</li>
</ul>
<p>Each of these can fit in just half a byte (or 4 bits).</p>
<p>Now, instead of using a whole byte (8 bits) for each number, you decide to pack two numbers into one byte.</p>
<p>Let's consider you have the numbers 3 (<code>0011</code>) and 14 (<code>1110</code>) to store:</p>
<ol>
<li><p>Convert each number to its 4-bit binary representation: <code>0011</code> for 3 and <code>1110</code> for 14.</p>
</li>
<li><p>Pack these two sets of 4 bits into one byte: <code>00111110</code>.</p>
</li>
</ol>
<p>If you are considering actual values from bit packing beyond the scope of the example above (repeats greater than 15), then you either split the runs or use a flag to identify only those run lengths and then display the exact number of repeats after that.</p>
<p>Now, when RLE is combined with bit-packing, it efficiently compresses runs of integers by first using RLE to encode the run lengths and then applying bit-packing to compress the actual values. This dual approach is particularly effective for numerical data with small ranges or repetitive sequences.</p>
<ol>
<li><p><strong>RLE on Runs</strong>: Identifies and encodes the length of runs.</p>
</li>
<li><p><strong>Bit-Packing on Values</strong>: Applies bit-packing to the values themselves, further reducing the storage space required.</p>
</li>
<li><p><strong>Use Case Example</strong>: A sequence of integers with many repeats and a small range, such as <code>[0, 0, 0, 1, 1, 2, 2, 2, 2]</code>, would first have RLE applied to compress the repeats, then bit-packing could compress the actual numbers since they require very few bits to represent.</p>
</li>
</ol>
<h4 id="heading-dictionary-encoding"><strong>Dictionary Encoding</strong></h4>
<p>Dictionary encoding is a compression technique where a unique dictionary is created for a dataset, mapping each distinct value to a unique integer identifier. This approach is highly effective in datasets with many unique records, but they repeat a few specific values often.</p>
<ol>
<li><p><strong>Dictionary Creation</strong>: Scan the dataset to identify all unique values. Each unique value is assigned a unique integer identifier. This mapping of values to identifiers forms the dictionary.</p>
</li>
<li><p><strong>Data Transformation</strong>: Replace each value in the dataset with its corresponding identifier from the dictionary. This results in a transformed dataset where repetitive values are replaced with their integer identifiers, often much smaller in size than the original values.</p>
</li>
<li><p><strong>Storage</strong>: The dictionary and the transformed dataset are stored together. To reconstruct the original data, a lookup in the dictionary is performed using the integer identifiers.</p>
</li>
</ol>
<p><strong>Example:</strong> Consider a column with the values ["Red", "Blue", "Red", "Green", "Blue"]. A dictionary might map these colors to integers as follows: {"Red": 1, "Blue": 2, "Green": 3}. The encoded column becomes [1, 2, 1, 3, 2].</p>
<h4 id="heading-dictionary-encoding-with-bit-packing"><strong>Dictionary Encoding with Bit Packing</strong></h4>
<p>If the range of unique integers is small, you can further compress this data with bit-packing.</p>
<ol>
<li><p><strong>Apply Dictionary Encoding</strong>: First, create a dictionary for the column's unique values, assigning each an integer identifier. Replace each original value in the column with its corresponding identifier.</p>
</li>
<li><p><strong>Assess the Dictionary Size</strong>: Determine the maximum identifier value used in your dictionary. This tells you the range of values you need to encode.</p>
</li>
<li><p><strong>Determine Bit Requirements</strong>: Calculate the minimum number of bits required to represent the highest identifier. For example, if your highest identifier is 15, you need 4 bits.</p>
</li>
<li><p><strong>Bit-Pack the Identifiers</strong>: Instead of storing these identifiers using standard byte or integer formats, you pack them into the minimum bit spaces determined. This might mean putting two 4-bit identifiers into a single 8-bit byte, effectively doubling the storage efficiency for these values. (Similar to what we went over in the 3 (<code>0011</code>) and 14 (<code>1110</code>) example above)</p>
</li>
</ol>
<h1 id="heading-data-serialization">Data Serialization</h1>
<p>Having explored file formats, our focus shifts to the critical aspect of how we mobilize this data. This is where serialization comes in- it acts as the conduit or channel, transforming data structures and objects into a streamlined sequence of bytes that can be easily stored, transmitted, and subsequently reassembled.</p>
<p>Spark uses serialization to:</p>
<ul>
<li><p>Transfer data across the network between different nodes in a cluster.</p>
</li>
<li><p>Save data to disk when necessary, such as persisting RDDs (which will be discussed in the future articles) or when spilling data from memory to disk during shuffles.</p>
</li>
</ul>
<h2 id="heading-serialization-libraries">Serialization Libraries</h2>
<p>Spark supports two main serialization libraries:</p>
<h3 id="heading-java-serialization-library"><strong>Java Serialization Library</strong></h3>
<ul>
<li><p><strong>Default Method</strong>: The default serialization method used by Spark is the native Java serialization facilitated by implementing the <code>java.io.Serializable</code> interface.</p>
</li>
<li><p><strong>Ease of Use</strong>: This method is straightforward to use since it requires minimal code changes, but it can be relatively slow and produce larger serialized objects, which may not be ideal for network transmission or disk storage.</p>
</li>
</ul>
<h3 id="heading-kryo-serialization-library"><strong>Kryo Serialization Library</strong></h3>
<ul>
<li><p><strong>Performance-Optimized</strong>: Kryo is a faster and more compact serialization framework compared to Java serialization. It can significantly reduce the amount of data that needs to be transferred over the network or stored.</p>
</li>
<li><p><strong>Configuration Required</strong>: To use Kryo serialization, it must be explicitly enabled and configured in Spark. This involves registering the classes you'll serialize with Kryo, which can be a bit more work but is often worth the effort for the performance gains.</p>
</li>
</ul>
<h2 id="heading-implementation-and-use-cases"><strong>Implementation and Use Cases</strong></h2>
<ul>
<li><p><strong>Configuring Spark for Serialization</strong>: In Spark, you can configure the serialization library at the start of your application through the Spark configuration settings, using <code>spark.serializer</code> property. For Kryo, you would set this to <code>org.apache.spark.serializer.KryoSerializer</code>.</p>
</li>
<li><p><strong>Serialization for Performance</strong>: Choosing the right serialization library and tuning your configuration can have a significant impact on the performance of your Spark applications, especially in data-intensive operations. Efficient serialization reduces the overhead of shuffling data between Spark workers and nodes, and when persisting data.</p>
</li>
</ul>
<h2 id="heading-best-practices"><strong>Best Practices</strong></h2>
<ul>
<li><p><strong>Use Kryo for Large Datasets</strong>: For applications that process large volumes of data or require extensive network communication, using Kryo serialization is recommended.</p>
</li>
<li><p><strong>Register Classes with Kryo</strong>: When using Kryo, explicitly registering your classes can improve performance further, as it allows Kryo to optimize serialization and deserialization operations.</p>
</li>
<li><p><strong>Tune Spark’s Memory Management</strong>: Properly configuring memory management in Spark, alongside efficient serialization, can reduce the occurrence of spills to disk and out-of-memory errors, leading to smoother and faster execution of Spark jobs. We'll discuss more on memory management in the upcoming articles.</p>
</li>
</ul>
<p>Whether encoding data in JSON for web transport, using Avro for its compact binary efficiency, or adopting Parquet's columnar approach for analytic processing, the method of serialization chosen is pivotal. It not only dictates the ease and efficiency of data movement but also shapes how data is preserved, accessed, and utilized across various platforms and applications.</p>
<h1 id="heading-deserialization-putting-the-puzzle-back-together"><strong>Deserialization: Putting the puzzle back together</strong></h1>
<p>After detailing the serialization process, it's essential to understand its counterpart: deserialization. Deserialization reverses serialization's actions, converting the sequence of bytes back into the original data structure or object. This process is crucial for retrieving serialized data from storage or after transmission, allowing it to be utilized effectively within applications.</p>
<h2 id="heading-sparks-approach-to-deserialization">Spark's Approach to Deserialization</h2>
<p>In Apache Spark, deserialization plays a vital role in:</p>
<ul>
<li><p><strong>Reading Data from Disk</strong>: When data persisted in storage (like RDDs or datasets) needs to be accessed for computation, Spark deserializes it into the original format for processing.</p>
</li>
<li><p><strong>Receiving Data Across the Network</strong>: Data sent between nodes in a cluster is deserialized upon receipt to be used in computations or stored efficiently.</p>
</li>
</ul>
<h2 id="heading-deserialization-libraries-in-spark">Deserialization Libraries in Spark</h2>
<p>As with serialization, Spark leverages the same libraries for deserialization:</p>
<ul>
<li><p><strong>Java Deserialization</strong>: By default, Spark uses Java’s built-in deserialization mechanism. This method is automatically applied to data serialized with Java serialization, ensuring compatibility and ease of use but may carry performance and size overheads.</p>
</li>
<li><p><strong>Kryo Deserialization</strong>: For data serialized with Kryo, Spark employs Kryo’s deserialization methods, which are designed for high performance and efficiency. Similar to serialization, proper configuration and class registration are essential for leveraging Kryo's full potential in deserialization processes.</p>
</li>
</ul>
<h2 id="heading-implementing-deserialization-in-spark-applications">Implementing Deserialization in Spark Applications</h2>
<ul>
<li><p><strong>Configuration for Deserialization</strong>: Similar to serialization, deserialization settings in Spark are configured at the application's outset via Spark configuration. <em>Ensuring that the deserialization library matches the serialization method used is crucial for correct data retrieval.</em></p>
</li>
<li><p><strong>Deserialization for Data Recovery and Processing</strong>: Effective deserialization is key for the rapid recovery of persisted data and the efficient execution of distributed computations, <em>impacting overall application performance</em>.</p>
</li>
</ul>
<h2 id="heading-best-practices-for-deserialization"><strong>Best Practices for Deserialization</strong></h2>
<ul>
<li><p><strong>Match Serialization and Deserialization Libraries</strong>: Ensure consistency between serialization and deserialization methods to avoid errors and data loss.</p>
</li>
<li><p><strong>Optimize Data Structures for Serialization/Deserialization</strong>: Design data structures with serialization efficiency in mind, particularly for applications with heavy data exchange or persistence requirements.</p>
</li>
<li><p><strong>Monitor Performance Impacts</strong>: Regularly assess the impact of serialization and deserialization on application performance, adjusting configurations as necessary to optimize speed and resource usage.</p>
</li>
</ul>
<h2 id="heading-serialization-between-memory-and-disk"><strong>Serialization between Memory and Disk</strong></h2>
<ol>
<li><p><strong>Serialization</strong>: Before writing data to disk, Spark serializes the data into a binary format. This is done to minimize the memory footprint and optimize disk I/O.</p>
</li>
<li><p><strong>Spilling</strong>: When the memory usage exceeds the available memory threshold, Spark spills data to disk to free up memory. This spilled data, in its serialized form, is written to disk partitions, typically in <em>temporary spill files</em>.</p>
</li>
<li><p><strong>Deserialization</strong>: When the spilled data needs to be read back into memory for further processing, Spark deserializes the data from its binary format back into objects.</p>
</li>
</ol>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">However, it's important to note that serialization and deserialization incur some computational cost, so Spark aims to minimize these operations and optimize their performance where possible.</div>
</div>

<p>The interplay between serialization and deserialization is a foundational aspect of data processing in distributed systems like Apache Spark- while serialization ensures that data is efficiently prepared for transfer or storage, deserialization makes that data usable again, completing the cycle of data mobility. As we continue exploring Spark's advanced capabilities and optimizations in future articles, the importance of efficient serialization and deserialization practices in enhancing performance and enabling complex computations over distributed data will become increasingly apparent.</p>
]]></content:encoded></item><item><title><![CDATA[Spark Illuminated]]></title><description><![CDATA[Dataframes vs Resilient Distributed Datasets
In the first article, I explained Dataframes in detail, and in the second, I talked about Resilient Distributed Datasets. But what exactly sets them apart?
Here's a table that summarizes the key difference...]]></description><link>https://vaishnave.page/spark-illuminated</link><guid isPermaLink="true">https://vaishnave.page/spark-illuminated</guid><category><![CDATA[spark]]></category><category><![CDATA[sparksql]]></category><category><![CDATA[#apache-spark]]></category><category><![CDATA[#rdd]]></category><category><![CDATA[RDD (Resilient Distributed Dataset)]]></category><category><![CDATA[dataframe]]></category><category><![CDATA[Python]]></category><category><![CDATA[Scala]]></category><category><![CDATA[joins]]></category><category><![CDATA[hadoop]]></category><dc:creator><![CDATA[Vaishnave Subbramanian]]></dc:creator><pubDate>Wed, 27 Mar 2024 19:05:01 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1711556032844/7cd6f242-cd4b-4dc7-883f-631b06a1d18e.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1 id="heading-dataframes-vs-resilient-distributed-datasets"><strong>Dataframes vs Resilient Distributed Datasets</strong></h1>
<p>In the first article, <a target="_blank" href="https://vaishnave.page/dabbling-with-spark-essentials#heading-understanding-dataframes-and-catalogs">I explained Dataframes in detail</a>, and in the second, <a target="_blank" href="https://vaishnave.page/dabbling-with-spark-essentials-again#heading-resilient-distributed-datasets-under-the-hood">I talked about Resilient Distributed Datasets</a>. But what exactly sets them apart?</p>
<p>Here's a table that summarizes the key differences between RDDs and DataFrames in Apache Spark, along with examples for each:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Feature</strong></td><td><strong>RDDs</strong></td><td><strong>DataFrames</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>Abstraction Level</strong></td><td>Low-level, providing fine-grained control over data and operations.</td><td>High-level, providing a more structured and higher-level API.</td></tr>
<tr>
<td><strong>Optimization</strong></td><td>No automatic optimization. Operations are executed as they are defined.</td><td>Uses the Catalyst optimizer for logical and physical plan optimizations.</td></tr>
<tr>
<td><strong>Ease of Use</strong></td><td>Requires more detailed and complex syntax for operations.</td><td>Simplified and more intuitive operations similar to SQL.</td></tr>
<tr>
<td><strong>Interoperability</strong></td><td>Primarily functional programming interfaces in Python, Scala, and Java.</td><td>Supports SQL queries, Python, Scala, Java, and R APIs.</td></tr>
<tr>
<td><strong>Typed and Untyped APIs</strong></td><td>Provides a type-safe API (especially in Scala and Java).</td><td>Offers mostly untyped APIs, except when using Datasets in Scala and Java.</td></tr>
<tr>
<td><strong>Performance</strong></td><td>May require manual optimization, like coalescing partitions. *</td><td>Generally faster due to Catalyst optimizer and Tungsten execution engine. *</td></tr>
<tr>
<td><strong>Fault Tolerance</strong></td><td>Achieved through lineage information allowing lost data to be recomputed.</td><td>Also fault-tolerant through Spark SQL's execution engine.</td></tr>
<tr>
<td><strong>Use Cases</strong></td><td>Suitable for complex computations, fine-grained transformations, and when working with unstructured data.</td><td>Best for standard data processing tasks on structured or semi-structured data, benefiting from built-in optimization.</td></tr>
</tbody>
</table>
</div><p>Note:</p>
<p>* Will be covered in future articles</p>
<h3 id="heading-examples-in-context"><strong>Examples in Context</strong></h3>
<ul>
<li><p><strong>RDD Example</strong>: Suppose you have a list of sales transactions and you want to filter out transactions with a value greater than 100.</p>
<pre><code class="lang-python">  transactions_rdd = sc.parallelize([(<span class="hljs-number">1</span>, <span class="hljs-number">150</span>), (<span class="hljs-number">2</span>, <span class="hljs-number">75</span>), (<span class="hljs-number">3</span>, <span class="hljs-number">200</span>)])
  filtered_rdd = transactions_rdd.filter(<span class="hljs-keyword">lambda</span> x: x[<span class="hljs-number">1</span>] &gt; <span class="hljs-number">100</span>)
</code></pre>
</li>
<li><p><strong>DataFrame Example</strong>: Achieving the same with a DataFrame:</p>
<pre><code class="lang-python">  transactions_df = spark.createDataFrame([(<span class="hljs-number">1</span>, <span class="hljs-number">150</span>), (<span class="hljs-number">2</span>, <span class="hljs-number">75</span>), (<span class="hljs-number">3</span>, <span class="hljs-number">200</span>)], [<span class="hljs-string">"id"</span>, <span class="hljs-string">"amount"</span>])
  filtered_df = transactions_df.filter(<span class="hljs-string">"amount &gt; 100"</span>)
</code></pre>
</li>
</ul>
<p>The choice between RDDs and DataFrames depends on the specific needs of your application, such as the need for custom, complex transformations (favouring RDDs) or for optimized, structured data processing (favoring DataFrames).</p>
<h1 id="heading-understanding-joins-in-apache-spark"><strong>Understanding Joins in Apache Spark</strong></h1>
<p>Let's break down the concept of joins in Apache Spark, starting with the basics and moving towards how to optimize these operations for efficiency.</p>
<h2 id="heading-introduction-to-joins"><strong>Introduction to Joins</strong></h2>
<p>If you've spent some time in the data domain, you might already be familiar with Joins. These allow us to merge datasets based on common attributes.</p>
<p>Spark supports several types of joins to cater to various needs:</p>
<ul>
<li><p>Inner Join: Retains only the row pairs with matching IDs from both tables.</p>
</li>
<li><p>Left Join: Retains all rows from the left table, along with their matches from the right table.</p>
</li>
<li><p>Full Outer Join: Merges the outcomes of both left and right joins, keeping all rows from both tables.</p>
</li>
<li><p>Cross Join: Generates a Cartesian product, pairing every row from the first table with every row from the second table.</p>
</li>
</ul>
<p>I won't be spending too much time on this topic (SQL basics) and instead will move into how Spark deals with Joins.</p>
<h2 id="heading-how-joins-work-in-spark"><strong>How Joins Work in Spark</strong></h2>
<p>To perform these joins, Spark may need to shuffle data. This shuffling, a mix of network and sort operations, aligns matching rows for the join but can be resource-intensive. I will go into detail on Shuffling in further articles on Spark Optimisation. But, for now, I want you to keep in mind that:</p>
<ul>
<li><p><strong>Shuffle Process</strong>: Spark redistributes data across its distributed system to align matching rows, ensuring all data with the same key (ID) are in the same partition for joining.</p>
</li>
<li><p><strong>Impact on Performance</strong>: Shuffling involves moving data across the network, which can be a bottleneck. Hence, optimizing this process is crucial for performance.</p>
</li>
</ul>
<p>What if, instead of optimizing data transfer between nodes (Node-to-node communication strategy), we focus on maximizing the efficiency of data processing within each individual node by bringing the data into each node (per node computation strategy)?</p>
<h2 id="heading-broadcast-joins">Broadcast Joins</h2>
<p>A broadcast join in Apache Spark is an optimization technique for join operations that can significantly improve performance, especially when one of the datasets involved in the join is much smaller than the other. To understand broadcast joins deeply, let’s explore their mechanism, advantages, and when to use them, using an analogy and then delving into the technical details.</p>
<h3 id="heading-the-broadcast-join-explained"><strong>The Broadcast Join Explained</strong></h3>
<p>Imagine you’re a teacher in a large school, and you have a small list of special announcements that you need to share with every classroom. Instead of asking every classroom to come to you (which would be chaotic and inefficient), you make copies of the announcements and deliver them to each classroom. Now, every classroom has direct access to the announcements, making the whole process smoother and faster.</p>
<p>In Apache Spark, a similar approach is taken with broadcast joins. Instead of moving large amounts of data across the network to perform a join, the smaller dataset is sent (broadcast) to every node in the cluster where the larger dataset resides. This eliminates the need for shuffling the larger dataset across the network, significantly reducing the amount of data transfer and, consequently, the time taken to perform the join.</p>
<h3 id="heading-in-depth-mechanics-of-broadcast-joins"><strong>In-depth Mechanics of Broadcast Joins</strong></h3>
<ol>
<li><p><strong>Identification</strong>: Spark automatically identifies when one side of the join operation is small enough to be broadcasted based on the <code>spark.sql.autoBroadcastJoinThreshold</code> configuration setting. This setting specifies the maximum size (in bytes) of a table that can be broadcast.</p>
</li>
<li><p><strong>Broadcasting</strong>: The smaller dataset is broadcasted to all executor nodes in the Spark cluster. This involves serializing the dataset and sending it over the network to each node.</p>
</li>
<li><p><strong>Local Join Operation</strong>: Once the smaller dataset is on every node, Spark performs the join operation locally, combining it with the relevant partition of the larger dataset that resides on the node. Since the data is located together, this operation is very efficient and avoids the costly shuffle operation that would otherwise be necessary.</p>
</li>
</ol>
<p>To illustrate the implementation of a broadcast join in Apache Spark using PySpark (the Python API for Spark), let's consider a simple example. Suppose we have two datasets: <strong>a large dataset</strong><code>employees</code> with employee details, and <strong>a smaller dataset</strong><code>departments</code> that contains department names and their IDs. Our goal is to use a broadcast join to efficiently join these two datasets on the department ID, ensuring that each employee record is enriched with the department name.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession
<span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> broadcast

<span class="hljs-comment"># Initialize a SparkSession</span>
spark = SparkSession.builder \
    .appName(<span class="hljs-string">"Broadcast Join Example"</span>) \
    .getOrCreate()

<span class="hljs-comment"># Example datasets</span>
employees_data = [
    (<span class="hljs-number">1</span>, <span class="hljs-string">"John Doe"</span>, <span class="hljs-number">2</span>),
    (<span class="hljs-number">2</span>, <span class="hljs-string">"Jane Doe"</span>, <span class="hljs-number">1</span>),
    (<span class="hljs-number">3</span>, <span class="hljs-string">"Mike Brown"</span>, <span class="hljs-number">2</span>),
    <span class="hljs-comment"># .... Assume more records</span>
]

departments_data = [
    (<span class="hljs-number">1</span>, <span class="hljs-string">"Human Resources"</span>),
    (<span class="hljs-number">2</span>, <span class="hljs-string">"Technology"</span>),
    <span class="hljs-comment"># .... Assume more departments</span>
]

<span class="hljs-comment"># Creating DataFrames</span>
employees_df = spark.createDataFrame(employees_data, [<span class="hljs-string">"emp_id"</span>, <span class="hljs-string">"name"</span>, <span class="hljs-string">"dept_id"</span>])
departments_df = spark.createDataFrame(departments_data, [<span class="hljs-string">"dept_id"</span>, <span class="hljs-string">"dept_name"</span>])

<span class="hljs-comment"># Performing a broadcast join (inner)</span>
<span class="hljs-comment"># Broadcasting the smaller DataFrame (departments_df)</span>
joined_df = employees_df.join(broadcast(departments_df), employees_df.dept_id == departments_df.dept_id, <span class="hljs-string">"inner"</span>)

<span class="hljs-comment"># Show the result of the join</span>
joined_df.show()

<span class="hljs-comment"># Stop the SparkSession</span>
spark.stop()
</code></pre>
<h3 id="heading-advantages-of-broadcast-joins"><strong>Advantages of Broadcast Joins</strong></h3>
<ul>
<li><p><strong>Performance</strong>: The primary advantage is the significant reduction in network traffic, which can greatly improve the performance of join operations.</p>
</li>
<li><p><strong>Efficiency</strong>: By avoiding shuffling, broadcast joins reduce the load on the network and the cluster, making the join process more efficient.</p>
</li>
<li><p><strong>Scalability</strong>: For joins with significantly disparate dataset sizes, broadcast joins allow Spark to scale more effectively, handling large datasets more efficiently by leveraging the parallel processing capabilities of the cluster.</p>
</li>
</ul>
<h3 id="heading-use-cases"><strong>Use Cases</strong></h3>
<p>Broadcast joins are particularly effective in scenarios where:</p>
<ul>
<li><p>One dataset is much smaller than the other and can fit into the memory of each node.</p>
</li>
<li><p>The smaller dataset is used repeatedly in join operations with different larger datasets, making the cost of broadcasting it justified over several operations.</p>
</li>
</ul>
<h3 id="heading-guidelines-for-use"><strong>Guidelines for Use</strong></h3>
<ul>
<li><p><strong>Size Consideration</strong>: Be mindful of the <code>spark.sql.autoBroadcastJoinThreshold</code> setting. If the smaller dataset is close to this size limit, consider manually broadcasting it if it's beneficial.</p>
</li>
<li><p><strong>Memory Management</strong>: Ensure that there is sufficient memory on each node to accommodate the broadcasted data without impacting the execution of other tasks.</p>
</li>
</ul>
<p>In summary, broadcast joins are a powerful optimization in Apache Spark that can lead to significant performance improvements in distributed data processing tasks, particularly when the size disparity between joining datasets is large. Understanding when and how to use broadcast joins can greatly enhance the efficiency of Spark applications.</p>
<h2 id="heading-other-advanced-joins">Other Advanced Joins</h2>
<p>Advanced join techniques in Apache Spark are designed to optimize the performance of join operations beyond the basic and broadcast join strategies. These techniques can significantly reduce the computational and memory overhead associated with joins, especially when dealing with large datasets. Let’s dive into some of these advanced techniques:</p>
<h3 id="heading-sort-merge-join"><strong>Sort-Merge Join</strong></h3>
<p>A sort-merge join is a method that Spark may choose when both sides of the join have already been sorted on the join key. This technique involves two main steps:</p>
<ul>
<li><p><strong>Sorting</strong>: If the datasets are not already sorted, they are sorted by the join key.</p>
</li>
<li><p><strong>Merging</strong>: The sorted datasets are then merged together. Because the data is sorted, Spark can efficiently merge the two datasets by sequentially scanning through them, significantly <em>reducing the need for shuffling data across the network</em>.</p>
</li>
</ul>
<p><strong>Use Cases</strong>: Sort-merge joins are particularly effective when dealing with large datasets that are already sorted or when the cost of sorting the data is offset by the reduced need for shuffling during the join.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Assuming df1 and df2 are the DataFrames to join</span>
df1_sorted = df1.sort(<span class="hljs-string">"joinKey"</span>)
df2_sorted = df2.sort(<span class="hljs-string">"joinKey"</span>)

<span class="hljs-comment"># Perform the join</span>
joined_df = df1_sorted.join(df2_sorted, df1_sorted.joinKey == df2_sorted.joinKey)
</code></pre>
<h3 id="heading-shuffle-hash-join"><strong>Shuffle Hash Join</strong></h3>
<p>The shuffle hash join is another technique that Spark might use, especially when one dataset is much larger than the other but not small enough for a broadcast join. This method works by:</p>
<ul>
<li><p><strong>Shuffling</strong>: The smaller dataset is shuffled (partitioned) across the cluster based on the join key.</p>
</li>
<li><p><strong>Hashing</strong>: Spark then builds an in-memory hash table of the smaller dataset on each partition.</p>
</li>
<li><p><strong>Joining</strong>: The larger dataset is scanned partition-wise, and for each row, Spark uses the hash table to quickly find matching rows from the smaller dataset.</p>
</li>
</ul>
<p><strong>Use Cases</strong>: This method is efficient when one dataset is moderately small, and its hash table can fit into the memory of each node, allowing for fast lookups.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Setting these parametersis is more about influencing Spark's </span>
<span class="hljs-comment"># choice rather than directly enforcing a Shuffle Hash Join.</span>
spark.conf.set(<span class="hljs-string">"spark.sql.join.preferSortMergeJoin"</span>, <span class="hljs-string">"false"</span>)
spark.conf.set(<span class="hljs-string">"spark.sql.autoBroadcastJoinThreshold"</span>, <span class="hljs-string">"-1"</span>)

<span class="hljs-comment"># Perform the join, Spark might use Shuffle Hash Join depending </span>
<span class="hljs-comment"># on data size and other conditions</span>
joined_df = df1.join(df2, [<span class="hljs-string">"joinKey"</span>])
</code></pre>
<h3 id="heading-bucketed-joins"><strong>Bucketed Joins</strong></h3>
<p>Bucketed joins in Spark leverage bucketing, a technique where data is divided based on a join key:</p>
<ul>
<li><p><strong>Bucketing</strong>: Data is split into buckets based on a hash of the join key, with each bucket containing a subset of data. This is done during data ingestion. To understand this better, <a target="_blank" href="https://vaishnave.page/from-my-way-to-the-hive-way#heading-bucketing">refer to my previous article where I discuss bucketing in detail.</a></p>
</li>
<li><p><strong>Joining</strong>: When performing a join on bucketed tables, Spark can <em>avoid shuffling by directly joining buckets with matching keys</em>.</p>
</li>
</ul>
<p><strong>Use Cases</strong>: Bucketed joins are most effective when dealing with frequent joins on large datasets where the join keys are known ahead of time. By avoiding shuffling, they can significantly improve join performance.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Bucketed Joins require your data to be bucketed and saved </span>
<span class="hljs-comment"># to disk beforehand</span>
df1.write.bucketBy(<span class="hljs-number">42</span>, <span class="hljs-string">"joinKey"</span>).sortBy(<span class="hljs-string">"joinKey"</span>).saveAsTable(<span class="hljs-string">"df1_bucketed"</span>)
df2.write.bucketBy(<span class="hljs-number">42</span>, <span class="hljs-string">"joinKey"</span>).sortBy(<span class="hljs-string">"joinKey"</span>).saveAsTable(<span class="hljs-string">"df2_bucketed"</span>)

<span class="hljs-comment"># Read bucketed data</span>
df1_bucketed = spark.table(<span class="hljs-string">"df1_bucketed"</span>)
df2_bucketed = spark.table(<span class="hljs-string">"df2_bucketed"</span>)

<span class="hljs-comment"># Perform the bucketed join without shuffling</span>
joined_bucketed_df = df1_bucketed.join(df2_bucketed, <span class="hljs-string">"joinKey"</span>)
</code></pre>
<p>Here, <code>42</code> specifies the number of buckets to create, and <code>"joinKey"</code> is the column used for bucketing. This means that rows with the same value in the <code>joinKey</code> column will end up in the same bucket.</p>
<h3 id="heading-skew-join-optimization"><strong>Skew Join Optimization</strong></h3>
<p>Data skew occurs when the distribution of values within a dataset is uneven, leading to some tasks taking much longer than others. Spark offers optimizations for skew joins:</p>
<ul>
<li><p><strong>Detecting and Handling Skew</strong>: Spark can detect skewed keys and replicate the skewed partition's data to all tasks, allowing each task to process a portion of the skewed data in parallel.</p>
</li>
<li><p><strong>Salting</strong>: A common technique to manually handle skew involves "salting" the join keys by adding a random value, thus distributing the skewed keys across multiple partitions to balance the load. This process creates multiple variations of each key, distributing the data more evenly across the cluster.</p>
</li>
</ul>
<p><strong>Use Cases</strong>: Skew join optimizations are crucial for performance when joins involve skewed datasets, helping to distribute the load more evenly across the cluster.</p>
<p>The following steps describe how you can perform the Skew Join:</p>
<ol>
<li><strong>Adding Salt to Join Keys</strong></li>
</ol>
<pre><code class="lang-python">df1 = df1.withColumn(<span class="hljs-string">"salt"</span>, (rand()*<span class="hljs-number">5</span>).cast(<span class="hljs-string">"int"</span>))
df2 = df2.withColumn(<span class="hljs-string">"salt"</span>, monotonically_increasing_id() % <span class="hljs-number">5</span>)
</code></pre>
<p><strong>For</strong><code>df1</code>: A new column named <code>salt</code> is added. This column is populated with random integers between 0 and 4, inclusive. The <code>rand()</code> function generates a random floating-point number between 0 and 1, which is then multiplied by 5 to scale it. Casting it to an integer effectively creates a random salt value within the desired range.</p>
<p><strong>For</strong><code>df2</code>: Similarly, a <code>salt</code> column is added. Here, <code>monotonically_increasing_id()</code> generates a unique, monotonically increasing integer for each row. Taking this value modulo 5 ensures that the salt values are distributed between 0 and 4, much like in <code>df1</code>, but in a sequential and deterministic manner rather than randomly.</p>
<ol start="2">
<li><strong>Replicating Skewed Data in</strong><code>df2</code></li>
</ol>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> concat, col
<span class="hljs-keyword">from</span> pyspark.sql.functions <span class="hljs-keyword">import</span> lit

<span class="hljs-comment"># Create a DataFrame with Salt Values</span>
salt_values = spark.range(<span class="hljs-number">5</span>).withColumnRenamed(<span class="hljs-string">"id"</span>, <span class="hljs-string">"salt"</span>)

<span class="hljs-comment"># Cross-Join df2 with the Salt Values DataFrame</span>
df1_cross_joined = df1.crossJoin(salt_values)
df2_cross_joined = df2.crossJoin(salt_values)

<span class="hljs-comment"># Modify the Join Key in Both DataFrames to Include the Salt</span>
df1_cross_joined = df1_cross_joined.withColumn(<span class="hljs-string">"saltedJoinKey"</span>, concat(col(<span class="hljs-string">"joinKey"</span>), lit(<span class="hljs-string">"_"</span>), col(<span class="hljs-string">"salt"</span>)))
df2_cross_joined = df2_cross_joined.withColumn(<span class="hljs-string">"saltedJoinKey"</span>, concat(col(<span class="hljs-string">"joinKey"</span>), lit(<span class="hljs-string">"_"</span>), col(<span class="hljs-string">"salt"</span>)))
</code></pre>
<p>This operation is intended to replicate each row of <code>df2</code> five times, with each replication having a different salt value from 0 to 4. The <code>flatMap</code> function is used to expand each original row into multiple rows, one for each salt value. This approach ensures that when joining <code>df1</code> and <code>df2</code>, the join operation is not bottlenecked by any single value of <code>joinKey</code> since the data corresponding to each <code>joinKey</code> in <code>df2</code> is now evenly distributed across all salt values.</p>
<ol start="3">
<li><strong>Performing the Join with the Salted Column</strong></li>
</ol>
<pre><code class="lang-python">joined_skewed_df = df1_cross_joined.join(df2_cross_joined, df1_cross_joined.saltedJoinKey == df2_cross_joined.saltedJoinKey)
</code></pre>
<p>The join operation now includes a condition which ensures that rows are matched by their <code>salt</code> value, effectively distributing the join operation more evenly across the cluster. But, make a note to always test and monitor the performance impact when using this technique since Cross-joins can significantly increase the volume of data being processed, and hence, should be used judiciously, especially with large datasets.</p>
<p>Implementing these advanced join techniques requires a good understanding of the data and the specific computational challenges it presents. Spark attempts to automatically choose the most efficient join strategy based on the data characteristics and the available cluster resources. However, for optimal performance, developers might need to manually adjust configurations, such as enabling or tuning specific join optimizations based on their understanding of the data and the nature of the join operation.</p>
<h1 id="heading-leveraging-shared-variables-in-apache-spark"><strong>Leveraging Shared Variables in Apache Spark</strong></h1>
<p>Apache Spark provides two types of shared variables to support different use cases when tasks across multiple nodes need to share data: broadcast variables and accumulators. Let's explore each type in detail.</p>
<h2 id="heading-introduction-to-shared-variables">Introduction to Shared Variables</h2>
<p>In distributed computing, operations are executed across many nodes, and sometimes these operations need to share some state or data among them. However, Spark's default behavior is to execute tasks on cluster nodes in an isolated manner. Shared variables come into play to allow these tasks to share data efficiently.</p>
<p><strong>Use Cases</strong>: Shared variables are used when you need to:</p>
<ul>
<li><p>Share a read-only variable with tasks across multiple nodes efficiently.</p>
</li>
<li><p>Accumulate results from tasks across multiple nodes in a central location.</p>
</li>
</ul>
<h2 id="heading-broadcast-variables">Broadcast Variables</h2>
<p>Broadcast variables allow the program to efficiently send a large, read-only variable to all worker nodes for use in one or more Spark operations.</p>
<h3 id="heading-advantages">Advantages</h3>
<ul>
<li><p><strong>Efficiency</strong>: Significantly reduces the amount of data that needs to be shipped to each node.</p>
</li>
<li><p><strong>Performance</strong>: Can cache the data in memory on each node, avoiding the need to re-send the data for each action.</p>
</li>
</ul>
<h3 id="heading-example-use-case"><strong>Example Use Case</strong></h3>
<p>Suppose you have a large lookup table that tasks across multiple nodes need to reference. Instead of sending this table with every task, you can broadcast it so that each node has a local copy.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark <span class="hljs-keyword">import</span> SparkContext
sc = SparkContext()

lookup_table = {<span class="hljs-string">"id1"</span>: <span class="hljs-string">"value1"</span>, <span class="hljs-string">"id2"</span>: <span class="hljs-string">"value2"</span>}  <span class="hljs-comment"># Large lookup table</span>
broadcastVar = sc.broadcast(lookup_table)

data = sc.parallelize([<span class="hljs-string">"id1"</span>, <span class="hljs-string">"id2"</span>, <span class="hljs-string">"id1"</span>])
result = data.map(<span class="hljs-keyword">lambda</span> x: (x, broadcastVar.value.get(x))).collect()
<span class="hljs-comment"># Output: [("id1", "value1"), ("id2", "value2"), ("id1", "value1")]</span>
</code></pre>
<h2 id="heading-accumulators">Accumulators</h2>
<p>Accumulators are variables that are only "added" to through an associative and commutative operation and can therefore be efficiently supported in parallel. They can be used to implement <strong>counters</strong> or sums.</p>
<h3 id="heading-advantages-1">Advantages</h3>
<ul>
<li><p><strong>Central Aggregation</strong>: Provide a simple syntax for aggregating values across multiple tasks.</p>
</li>
<li><p><strong>Fault Tolerance</strong>: If a task fails and is re-executed, its updates to the accumulator are not reapplied, thus maintaining accuracy.</p>
</li>
</ul>
<h3 id="heading-example-use-case-1"><strong>Example Use Case</strong></h3>
<p>You want to count</p>
<ul>
<li><p>how many times each key appears across all partitions of a dataset, or</p>
</li>
<li><p>debug by counting how many records were processed or</p>
</li>
<li><p>how many had missing information.</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark <span class="hljs-keyword">import</span> SparkContext
sc = SparkContext()

data = sc.parallelize([<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">5</span>])
counter = sc.accumulator(<span class="hljs-number">0</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">count_function</span>(<span class="hljs-params">x</span>):</span>
    <span class="hljs-keyword">global</span> counter
    counter += x

data.foreach(count_function)
print(counter.value)  
<span class="hljs-comment"># Output will be the sum of the numbers in the dataset: 15</span>
</code></pre>
<p>Both broadcast variables and accumulators provide essential capabilities for optimizing and managing state in distributed computations with Apache Spark, facilitating efficient data sharing and aggregation across the cluster.</p>
<h1 id="heading-datasets"><strong>Datasets</strong></h1>
<p>We've covered the various abstractions in PySpark, including DataFrames and RDDs. But what about those not native to PySpark?</p>
<p>Datasets are a strongly-typed collection of distributed data. They extend the functionality of DataFrames by adding compile-time type safety. This is particularly useful in Scala due to its static type system, where type information is known and checked at compile time.</p>
<p>Note - To simplify the comparison, imagine Python as similar to JavaScript, and Scala as akin to TypeScript, especially if you're familiar with application development (Only in terms of type strictness).</p>
<h2 id="heading-key-features">Key Features</h2>
<ul>
<li><p><strong>Type Safety</strong>: Datasets provide compile-time type safety. This means errors like accessing a non-existent column name or performing operations with incompatible data types can be caught during compilation rather than at runtime.</p>
</li>
<li><p><strong>Lambda Functions</strong>: Similar to RDDs, Datasets allow you to manipulate data using functional programming constructs. However, unlike RDDs, Datasets benefit from Spark SQL's optimization strategies (Catalyst optimizer).</p>
</li>
<li><p><strong>Interoperability with DataFrames</strong>: A Dataset can be considered a typed DataFrame. In fact, DataFrame is just a type alias for <code>Dataset[Row]</code>, where <code>Row</code> is a generic untyped JVM object. This allows seamless conversion and operation between typed Datasets and untyped DataFrames.</p>
</li>
</ul>
<h2 id="heading-example-usage-in-scala">Example Usage in Scala</h2>
<p>Don't worry too much about Scala or the syntax. Just focus on the comments to understand the code's purpose.</p>
<pre><code class="lang-scala"><span class="hljs-keyword">import</span> spark.implicits._

<span class="hljs-comment">// Define a case class that represents the schema of your data</span>
<span class="hljs-keyword">case</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Person</span>(<span class="hljs-params">name: <span class="hljs-type">String</span>, age: <span class="hljs-type">Long</span></span>)</span>

<span class="hljs-comment">// Creating a Dataset from a Seq collection</span>
<span class="hljs-keyword">val</span> peopleDS = <span class="hljs-type">Seq</span>(<span class="hljs-type">Person</span>(<span class="hljs-string">"Alice"</span>, <span class="hljs-number">30</span>), <span class="hljs-type">Person</span>(<span class="hljs-string">"Bob"</span>, <span class="hljs-number">25</span>)).toDS()

<span class="hljs-comment">// Perform transformations using lambda functions</span>
<span class="hljs-keyword">val</span> adultsDS = peopleDS.filter(_.age &gt;= <span class="hljs-number">18</span>)

<span class="hljs-comment">// Show results</span>
adultsDS.show()

<span class="hljs-comment">// Is 25 really adulting though?</span>
</code></pre>
<h1 id="heading-encoders"><strong>Encoders</strong></h1>
<p>Encoders are what Spark uses under the hood to convert between Java Virtual Machine (JVM) objects and Spark SQL's internal tabular format. This conversion is necessary for Spark to perform operations on data within Datasets efficiently. I'll talk more on JVM in the future articles.</p>
<h2 id="heading-key-features-1">Key Features</h2>
<ul>
<li><p><strong>Efficiency</strong>: Encoders are responsible for Spark's ability to handle serialization and deserialization cost-effectively. They allow Spark to compress data into a binary format and operate directly on this binary representation, reducing memory usage and improving processing speed.</p>
</li>
<li><p><strong>Custom Objects</strong>: For custom objects, like the <code>Person</code> case class in the example above, Spark requires an implicit Encoder to be available. Scala's implicits and the <code>Datasets</code> API make working with custom types seamless.</p>
</li>
</ul>
<h2 id="heading-example-usage-in-scala-1">Example Usage in Scala</h2>
<p>Implicitly, Encoders are used in the Dataset operations shown above. However, when creating Datasets from RDDs or <em>dealing with more complex types</em>, you might need to provide an encoder explicitly:</p>
<pre><code class="lang-scala"><span class="hljs-keyword">import</span> org.apache.spark.sql.<span class="hljs-type">Encoders</span>

<span class="hljs-comment">// Explicitly specifying an encoder for a Dataset of type String</span>
<span class="hljs-keyword">val</span> stringEncoder = <span class="hljs-type">Encoders</span>.<span class="hljs-type">STRING</span>

<span class="hljs-keyword">val</span> namesDS = spark.sparkContext
  .parallelize(<span class="hljs-type">Seq</span>(<span class="hljs-string">"Alice"</span>, <span class="hljs-string">"Bob"</span>))
  .toDS()(stringEncoder) <span class="hljs-comment">// Applying the encoder explicitly</span>

namesDS.show()
</code></pre>
<p>For a PySpark user, understanding Datasets and Encoders can offer insights into how Spark manages type safety, serialization, and optimization at a deeper level. While PySpark leverages dynamic typing and does not directly expose Datasets, the principles of efficient data representation, processing, and the ability to perform strongly-typed operations are universally beneficial across Spark's API ecosystem.</p>
]]></content:encoded></item><item><title><![CDATA[Dabbling with Spark Essentials Again]]></title><description><![CDATA[Apache Spark stands out as a pivotal tool for navigating the complexities of Big Data analysis. This article embarks on a comprehensive journey through the core of Spark, from unraveling the intricacies of Big Data and its foundational concepts to ma...]]></description><link>https://vaishnave.page/dabbling-with-spark-essentials-again</link><guid isPermaLink="true">https://vaishnave.page/dabbling-with-spark-essentials-again</guid><category><![CDATA[spark]]></category><category><![CDATA[#apache-spark]]></category><category><![CDATA[Python]]></category><category><![CDATA[RDD (Resilient Distributed Dataset)]]></category><category><![CDATA[PySpark]]></category><dc:creator><![CDATA[Vaishnave Subbramanian]]></dc:creator><pubDate>Thu, 21 Mar 2024 17:30:29 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1711037062784/91c5cd71-f937-4c65-aa9b-b2db31aaa3c0.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Apache Spark stands out as a pivotal tool for navigating the complexities of Big Data analysis. This article embarks on a comprehensive journey through the core of Spark, from unraveling the intricacies of Big Data and its foundational concepts to mastering sophisticated data processing techniques with PySpark. Whether you're a beginner eager to dip your toes into the vast ocean of data analysis or a seasoned analyst aiming to refine your skills with Spark's advanced functionalities, this guide offers a structured pathway to enhance your proficiency in transforming raw data into insightful, actionable knowledge. I delve deep into essential concepts such as lazy evaluation, explain plan, and manipulation of Spark RDDs.</p>
<h1 id="heading-sparkcontext"><strong>SparkContext</strong></h1>
<p>The entry point to programming Spark with the RDD interface. It is responsible for managing and distributing data across the Spark cluster and creating RDDs. It can load data from various sources like HDFS (Hadoop Distributed File System), local file systems, and external databases. Through <code>SparkContext</code>, you can set various Spark properties and configurations that control aspects like memory usage, core utilization, and the behavior of Spark’s scheduler.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark <span class="hljs-keyword">import</span> SparkContext
sc = SparkContext(master=<span class="hljs-string">"local"</span>, appName=<span class="hljs-string">"My App"</span>)
</code></pre>
<h2 id="heading-info">Info</h2>
<p>Retrieves the SparkContext version, the Python version, and the master URL, respectively.</p>
<pre><code class="lang-python">print(sc.version, sc.pythonVer, sc.master)
</code></pre>
<h2 id="heading-sparkcontext-vs-sparksession"><strong>SparkContext vs SparkSession</strong></h2>
<p>While SparkContext is used for accessing Spark features through RDDs, SparkSession provides a single point of entry for DataFrames, streaming, or Hive features including HiveContext, SQLContext or Streaming Context. Also, with the introduction of Spark 2.0, SparkSession was introduced as a new entry point for Spark applications.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession
spark = SparkSession.builder.appName(<span class="hljs-string">"My App"</span>).getOrCreate()
</code></pre>
<h1 id="heading-resilient-distributed-datasets-under-the-hood">Resilient Distributed Datasets <strong>- Under the Hood</strong></h1>
<iframe src="https://giphy.com/embed/xTiIzHJyaDjBJzWKnS" width="480" height="208" class="giphy-embed"></iframe>

<p>In the last article, I briefly explained about <a target="_blank" href="https://vaishnave.page/dabbling-with-spark-essentials#heading-directed-acyclic-graph-dag:~:text=as%20they%20track%20the%20lineage%20of%20transformations%20applied%20to%20them%2C%20allowing%20for%20re%2Dcomputation%20in%20case%20of%20node%20failures.%20(Analogous%20to%20Git%20version%20tracking)">the lineage of transformations that promotes the fault tolerant nature of Resilient Distributed Datasets</a>. Let's look at that in detail now:</p>
<h2 id="heading-lazy-evaluation"><strong>Lazy Evaluation</strong></h2>
<p>When you apply a transformation like <code>sort</code> to your data in Spark, it's crucial to understand that Spark operates under a principle known as <strong>Lazy Evaluation</strong>. This means that when you call a transformation, Spark doesn't immediately manipulate your data. Instead, what happens is that Spark queues up a series of transformations, building a plan for how it will eventually execute these operations across the cluster when necessary.</p>
<p>Lazy evaluation in Spark ensures that transformations like <code>sort</code> don't trigger any immediate action on the data. The transformation is registered as a part of the execution plan, but <em>Spark waits to execute any operations until an action</em> (such as <code>collect</code>, <code>count</code>, or <code>saveAsTextFile</code>) i<em>s called</em>.</p>
<iframe src="https://giphy.com/embed/3o84sw9CmwYpAnRRni" width="480" height="208" class="giphy-embed"></iframe>

<p>This delay in execution allows Spark to map out a comprehensive strategy for distributing the data processing workload, enhancing the system's ability to manage resources and execute tasks in an optimized manner.</p>
<h2 id="heading-explain-plan"><strong>Explain Plan</strong></h2>
<p>To inspect the execution plan that Spark has prepared, including the sequence of transformations and how they will be applied, you can use the <code>explain</code> method on any DataFrame object. This method reveals the DataFrame's lineage, illustrating how Spark intends to execute the given query. This insight into the execution plan is particularly valuable for optimizing performance and understanding the sequence of operations Spark will undertake to process your data.</p>
<h2 id="heading-lineage-graph"><strong>Lineage Graph</strong></h2>
<p>Now, the question arises- What actually are these "lineages"?</p>
<p>RDD lineage is essentially a record of what operations were performed on the data from the time it was loaded into memory up to the present. It’s a directed acyclic graph (DAG) of the entire parent RDDs of an RDD. This graph consists of edges and nodes, where the <strong>nodes represent the RDDs</strong> and the <strong>edges represent the transformations</strong> that lead from one RDD to the next.</p>
<h1 id="heading-creating-and-using-rdds">Creating and Using RDDs</h1>
<h2 id="heading-parallelized-collection"><strong>Parallelized collection</strong></h2>
<p>Distributes a local Python collection to form an RDD. Useful for parallelizing existing Python collections into Spark RDDs.</p>
<pre><code class="lang-python">parallelized_data = sc.parallelize([<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>, <span class="hljs-number">5</span>])
</code></pre>
<h2 id="heading-from-external-tables"><strong>From external tables</strong></h2>
<p>Used for creating distributed collections of unstructured data from a text file.</p>
<pre><code class="lang-python">rdd_from_text_file = sc.textFile(<span class="hljs-string">"path/to/textfile"</span>)
</code></pre>
<h2 id="heading-partitioning"><strong>Partitioning</strong></h2>
<p>Partitioning in RDDs refers to the division of the dataset into smaller, logical portions that can be distributed across multiple nodes in a Spark cluster. This concept is fundamental to distributed computing in Spark, enabling parallel processing, which significantly enhances performance for large datasets. More on this topic will be covered in an upcoming article.</p>
<pre><code class="lang-python">rdd_with_partitions = rdd_from_file.repartition(<span class="hljs-number">10</span>)
print(rdd_with_partitions.getNumPartitions())
</code></pre>
<p>Alters the number of partitions in an RDD to distribute data more evenly across the cluster.</p>
<h2 id="heading-converting-an-rdd-to-a-dataframe"><strong>Converting an RDD to a DataFrame</strong></h2>
<p>The conversion of an RDD to a DataFrame in PySpark can be accomplished through multiple methods, involving the definition of the schema in different ways.</p>
<ul>
<li><code>StructType</code></li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql.types <span class="hljs-keyword">import</span> StructType, StructField, StringType, IntegerType
schema = StructType([
    StructField(<span class="hljs-string">"name"</span>, StringType(), <span class="hljs-literal">True</span>),
    StructField(<span class="hljs-string">"age"</span>, IntegerType(), <span class="hljs-literal">True</span>),
    StructField(<span class="hljs-string">"city"</span>, StringType(), <span class="hljs-literal">True</span>)
])
rdd = sc.parallelize([(<span class="hljs-string">"Ankit"</span>, <span class="hljs-number">30</span>, <span class="hljs-string">"Bangalore"</span>), (<span class="hljs-string">"Anitha"</span>, <span class="hljs-number">25</span>, <span class="hljs-string">"Chennai"</span>), (<span class="hljs-string">"Aditya"</span>, <span class="hljs-number">35</span>, <span class="hljs-string">"Mumbai"</span>)])
df = spark.createDataFrame(rdd, schema)
</code></pre>
<ul>
<li><code>pyspark.sql.Row</code></li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> Row
rdd = sc.parallelize([Row(name=<span class="hljs-string">"Ankit"</span>, age=<span class="hljs-number">30</span>, city=<span class="hljs-string">"Bangalore"</span>), Row(name=<span class="hljs-string">"Anitha"</span>, age=<span class="hljs-number">25</span>, city=<span class="hljs-string">"Chennai"</span>), Row(name=<span class="hljs-string">"Aditya"</span>, age=<span class="hljs-number">35</span>, city=<span class="hljs-string">"Mumbai"</span>)])
df = spark.createDataFrame(rdd)
</code></pre>
<h2 id="heading-converting-csv-to-dataframe"><strong>Converting CSV to DataFrame</strong></h2>
<pre><code class="lang-python">df_from_csv = spark.read.csv(<span class="hljs-string">"path/to/file.csv"</span>, header=<span class="hljs-literal">True</span>, inferSchema=<span class="hljs-literal">True</span>)
</code></pre>
<p>Directly reads a CSV file into a DataFrame, inferring the schema.</p>
<h2 id="heading-inspecting-data-in-dataframe"><strong>Inspecting Data in DataFrame</strong></h2>
<pre><code class="lang-python">df_from_csv.printSchema()
df_from_csv.describe().show()
</code></pre>
<p>Displays the schema of the DataFrame and shows summary statistics.</p>
<h2 id="heading-dataframe-visualization"><strong>DataFrame Visualization</strong></h2>
<pre><code class="lang-python">pandas_df = df_from_csv.toPandas()
</code></pre>
<p>Converts a Spark DataFrame to a Pandas DataFrame for visualization.</p>
<h2 id="heading-sql-queries"><strong>SQL queries</strong></h2>
<pre><code class="lang-python">df_from_csv.createOrReplaceTempView(<span class="hljs-string">"people"</span>)
spark.sql(<span class="hljs-string">"SELECT * FROM people WHERE age &gt; 30"</span>).show()
</code></pre>
<p><a target="_blank" href="https://vaishnave.page/dabbling-with-spark-essentials#heading-directed-acyclic-graph-dag:~:text=createOrReplaceTempView()%3A%20This,in%20Spark%20SQL.">Executes SQL queries directly on DataFrames that have been registered as a temporary view.</a></p>
<h1 id="heading-types-of-rdds">Types of RDDs</h1>
<p>RDDs are categorized based on their characteristics, which can include their partitioning strategy or the data types they contain. Understanding these categories is essential for optimizing the performance of Spark applications, as <em>each type has its implications for how data is processed and distributed across the cluster</em>. Here's a deeper look into these categorizations:</p>
<h2 id="heading-based-on-partitioning"><strong>Based on Partitioning</strong></h2>
<h3 id="heading-parallelcollection-rdds">ParallelCollection RDDs</h3>
<p>These RDDs are created from existing Scala or Python collections. When you parallelize a collection to create an RDD using the <code>parallelize()</code> method, Spark <em>distributes the elements of the collection across the cluster</em>. The distribution is determined by the number of partitions you specify or by the default parallelism level of the cluster. ParallelCollection RDDs are useful for parallelizing small datasets or for testing and prototyping with Spark.</p>
<h3 id="heading-mappartitions-rdds">MapPartitions RDDs</h3>
<p>MapPartitions RDDs result from transformations that apply a function to <em>every partition of the parent RDD, rather than to individual elements</em>. Transformations like <code>mapPartitions()</code> and <code>mapPartitionsWithIndex()</code> create MapPartitions RDDs. This approach can be more efficient than applying a function to each element, especially when initialising an expensive operation (like opening a database connection) that is better done once per partition rather than once per element.</p>
<h3 id="heading-shuffled-rdds">Shuffled RDDs</h3>
<p>Shuffled RDDs are the result of transformations that cause data to be redistributed across the cluster, such as <code>repartition()</code> or by key-based transformations like <code>reduceByKey()</code>, <code>groupBy()</code>, and <code>join()</code>. These operations involve <em>shuffling data across the partitions to regroup it</em> according to the transformation's requirements. Shuffles are expensive operations in terms of network and disk I/O, so understanding when your operations result in Shuffled RDDs can help in optimizing your Spark jobs.</p>
<h2 id="heading-based-on-data-types"><strong>Based on Data Types</strong></h2>
<p>RDDs can also be distinguished by the types of data they contain. This classification is more about the content of the RDD rather than its structural characteristics.</p>
<h3 id="heading-generic-rdds-like-python-rdd">Generic RDDs (like Python RDD)</h3>
<p>These are RDDs that contain any type of object, such as integers, strings, or custom objects. They are the most flexible type of RDD, as Spark does not impose any constraints on the data type. However, operations on these RDDs may not be as optimized as those on more specific types of RDDs.</p>
<h3 id="heading-key-value-pair-rdds-pair-rdds">Key-Value Pair RDDs (Pair RDDs)</h3>
<p>Pair RDDs contain tuples as elements, where each tuple consists of a key and a value. This structure is especially useful for operations that need to process data based on keys, such as aggregating or grouping data by key. I will also go into detail about this specific type of RDD later.</p>
<p>Understanding the different types of RDDs and their characteristics is crucial for writing efficient Spark applications. By choosing the appropriate type of RDD and transformations, developers can optimize data distribution and processing strategies, improving the performance and scalability of their Spark applications.</p>
<h1 id="heading-transformations-on-rdds">Transformations on RDDs</h1>
<p>Transformations create a new RDD from an existing one. Here are some key transformations:</p>
<h2 id="heading-map">map()</h2>
<p>Applies a function to each element of the RDD, returning a new RDD with the results.</p>
<pre><code class="lang-python">rdd = sc.parallelize([<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>, <span class="hljs-number">4</span>])
mapped_rdd = rdd.map(<span class="hljs-keyword">lambda</span> x: x * <span class="hljs-number">2</span>)
<span class="hljs-comment"># Resulting RDD: [2, 4, 6, 8]</span>
</code></pre>
<h2 id="heading-filter">filter()</h2>
<p>Returns a new RDD containing only the elements that satisfy a condition.</p>
<pre><code class="lang-python">filtered_rdd = rdd.filter(<span class="hljs-keyword">lambda</span> x: x % <span class="hljs-number">2</span> == <span class="hljs-number">0</span>)
<span class="hljs-comment"># Resulting RDD: [2, 4]</span>
</code></pre>
<p>This operation does not modify the original RDD but creates a new one with the filtered results.</p>
<h2 id="heading-flatmap">flatMap()</h2>
<p>Similar to <code>map</code>, but each input item can be mapped to 0 or more output items.</p>
<pre><code class="lang-python">sentences_rdd = sc.parallelize([<span class="hljs-string">"hello world"</span>, <span class="hljs-string">"how are you"</span>])
flattened_rdd = sentences_rdd.flatMap(<span class="hljs-keyword">lambda</span> x: x.split(<span class="hljs-string">" "</span>))
<span class="hljs-comment"># Resulting RDD: ["hello", "world", "how", "are", "you"]</span>
</code></pre>
<h1 id="heading-actions-on-rdds"><strong>Actions on RDDs</strong></h1>
<p>Actions trigger the execution of the transformations and return results.</p>
<h2 id="heading-collect">collect()</h2>
<p>Gathers the entire RDD's data to the driver program. Suitable for small datasets.</p>
<pre><code class="lang-python">all_elements = mapped_rdd.collect()
<span class="hljs-comment"># Result: [2, 4, 6, 8]</span>
</code></pre>
<h2 id="heading-taken">take(N)</h2>
<p>Returns an array with the first N elements of the RDD.</p>
<pre><code class="lang-python">first_two = mapped_rdd.take(<span class="hljs-number">2</span>)
<span class="hljs-comment"># Result: [2, 4]</span>
</code></pre>
<h2 id="heading-first">first()</h2>
<p>Retrieves the first element of the RDD.</p>
<pre><code class="lang-python">first_element = rdd.first()
<span class="hljs-comment"># Result: 1</span>
</code></pre>
<h2 id="heading-count">count()</h2>
<p>Counts the number of elements in the RDD.</p>
<pre><code class="lang-python">total = rdd.count()
<span class="hljs-comment"># Result: 4</span>
</code></pre>
<h2 id="heading-reduce">reduce()</h2>
<p>Aggregates the elements of the RDD using a function, bringing the result to the driver.</p>
<pre><code class="lang-python">sum = rdd.reduce(<span class="hljs-keyword">lambda</span> x, y: x + y)
<span class="hljs-comment"># Result: 10</span>
sum_mapped = mapped_rdd.reduce(<span class="hljs-keyword">lambda</span> x, y: x + y)
<span class="hljs-comment"># Result: 20</span>
product = rdd.reduce(<span class="hljs-keyword">lambda</span> x, y: x * y)
<span class="hljs-comment"># Result: 24</span>
</code></pre>
<h2 id="heading-saveastextfile">saveAsTextFile()</h2>
<p>Saves the RDD to the filesystem as a text file.</p>
<pre><code class="lang-python">rdd.saveAsTextFile(<span class="hljs-string">"path/to/output"</span>)
</code></pre>
<h2 id="heading-coalesce">coalesce()</h2>
<p>Reduces the number of partitions in the RDD, useful for reducing the dataset size before collecting it on the driver. This method is particularly useful for increasing the efficiency of operations that may benefit from fewer partitions, such as reducing the number of files generated by actions like <code>saveAsTextFile()</code> or optimizing data for export to a single file.</p>
<pre><code class="lang-python"><span class="hljs-comment"># This operation prepares the RDD for efficient saving or collecting </span>
<span class="hljs-comment"># but does not alter its content.</span>

coalesced_rdd = rdd.coalesce(<span class="hljs-number">1</span>)

<span class="hljs-comment">#Result - reduces the number of partitions in a dataset to 1</span>
</code></pre>
<p>Each of these transformations and actions plays a vital role in manipulating and retrieving data from RDDs in Spark, enabling efficient distributed data processing across a cluster.</p>
<h1 id="heading-key-value-pair-rdds-in-detail">Key-Value Pair RDDs in Detail</h1>
<p>Spark includes dedicated transformations and actions designed for use with Pair RDDs, and I'll cover these in the following discussion.</p>
<h2 id="heading-creating-pair-rdds">Creating Pair RDDs</h2>
<h3 id="heading-tuples"><strong>Tuples</strong></h3>
<p>You can create Pair RDDs by mapping an existing RDD to a format where each element is a tuple representing a key-value pair.</p>
<pre><code class="lang-python">rdd = sc.parallelize([<span class="hljs-string">'apple'</span>, <span class="hljs-string">'banana'</span>, <span class="hljs-string">'apple'</span>, <span class="hljs-string">'orange'</span>, <span class="hljs-string">'banana'</span>, <span class="hljs-string">'apple'</span>])
pair_rdd = rdd.map(<span class="hljs-keyword">lambda</span> fruit: (fruit, <span class="hljs-number">1</span>))
</code></pre>
<h3 id="heading-regular-rdds"><strong>Regular RDDs</strong></h3>
<p>Regular RDDs can be transformed into Pair RDDs using transformations that produce key-value pairs, demonstrating the flexibility in handling structured data.</p>
<pre><code class="lang-python">pair_rdd_from_regular = sc.parallelize([(<span class="hljs-string">"apple"</span>, <span class="hljs-number">1</span>), (<span class="hljs-string">"banana"</span>, <span class="hljs-number">2</span>), (<span class="hljs-string">"apple"</span>, <span class="hljs-number">3</span>)])
</code></pre>
<h2 id="heading-transformations">Transformations</h2>
<p>Transformations on Pair RDDs allow for sophisticated aggregation and sorting based on keys.</p>
<h3 id="heading-reducebykey"><strong>reduceByKey()</strong></h3>
<p>Aggregates values for each key using a specified reduce function.</p>
<pre><code class="lang-python">reduced_rdd = pair_rdd.reduceByKey(<span class="hljs-keyword">lambda</span> a, b: a + b)
<span class="hljs-comment"># Output example: [('banana', 2), ('apple', 3), ('orange', 1)]</span>
</code></pre>
<h3 id="heading-sortbykey"><strong>sortByKey()</strong></h3>
<p>Sorts the RDD by keys.</p>
<pre><code class="lang-python">sorted_rdd = reduced_rdd.sortByKey()
<span class="hljs-comment"># Output example: [('apple', 3), ('banana', 2), ('orange', 1)]</span>
</code></pre>
<h3 id="heading-groupbykey"><strong>groupByKey()</strong></h3>
<p>Groups values with the same key.</p>
<pre><code class="lang-python">grouped_rdd = pair_rdd.groupByKey().mapValues(list)
<span class="hljs-comment"># Output example: [('banana', [1, 1]), ('apple', [1, 1, 1]), ('orange', [1])]</span>
</code></pre>
<p>This operation groups all the values under each key into a list. It demonstrates the distribution of individual occurrences before aggregation.</p>
<h3 id="heading-join"><strong>join()</strong></h3>
<p>Joins two RDDs based on their keys.</p>
<pre><code class="lang-python">price_rdd = sc.parallelize([(<span class="hljs-string">"apple"</span>, <span class="hljs-number">0.5</span>), (<span class="hljs-string">"banana"</span>, <span class="hljs-number">0.75</span>), (<span class="hljs-string">"orange"</span>, <span class="hljs-number">0.8</span>)])
joined_rdd = reduced_rdd.join(price_rdd)
</code></pre>
<h2 id="heading-actions">Actions</h2>
<p>Actions on Pair RDDs trigger computation and return results to the driver program or store them in external storage.</p>
<h3 id="heading-countbykey"><strong>countByKey()</strong></h3>
<p>Counts the number of elements for each key.</p>
<pre><code class="lang-python">counts_by_key = pair_rdd.countByKey()
<span class="hljs-comment"># Output example: defaultdict(&lt;class 'int'&gt;, {'apple': 3, 'banana': 2, 'orange': 1})</span>
</code></pre>
<p>This result is a Python dictionary-like object (specifically a <code>defaultdict</code> of type <code>&lt;class 'int'&gt;</code>) that shows how many times each fruit appears in the original dataset.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">Notice how we got a very similar result in <code>reduceByKey()</code> on using that specific lambda function. Both operations seem to aggregate data based on keys. But, what is the difference apart from the <code>defaultdict</code> type result?</div>
</div>

<p><code>reduceByKey()</code>: Performs a shuffle across the nodes to aggregate values by key. It's more flexible, allowing for complex aggregation functions beyond simple counting. It's suitable when your RDD contains complex data manipulations or when the dataset is not initially in the form of <code>(key, 1)</code> pairs.</p>
<p><code>countByKey()</code>: Provides a more direct approach for counting keys when the RDD is already structured as <code>(key, 1)</code> pairs, making it efficient for simple counting tasks. However, it collects the results to the driver, which could be problematic for very large datasets.</p>
<p>Now you understand which option to use and when.</p>
<h3 id="heading-collectasmap"><strong>collectAsMap()</strong></h3>
<p>Collects the result as a dictionary to bring back to the driver program.</p>
<pre><code class="lang-python">result_map = reduced_rdd.collectAsMap()
<span class="hljs-comment"># Output example: {'banana': 2, 'apple': 3, 'orange': 1}</span>
</code></pre>
<p>These operations on Pair RDDs illustrate Spark's powerful capabilities for distributed data processing, especially when working with data that naturally fits into key-value pairs.</p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>In conclusion, this article has delved deeply into the foundational aspects of Apache Spark, highlighting the pivotal role of SparkContext as the gateway to leveraging Spark's powerful capabilities. Additionally, we've explored the intricacies of Resilient Distributed Datasets (RDDs), uncovering how they serve as the backbone for distributed data processing within Spark. As we move forward, anticipate further exploration into the realms of Spark optimization and memory management. I'm excited to have you join me on this learning journey!</p>
]]></content:encoded></item><item><title><![CDATA[Dabbling with Spark Essentials]]></title><description><![CDATA[Embarking on the journey of understanding Apache Spark marks the beginning of an exciting series designed for both newcomers and myself, as we navigate the complexities of big data processing together. Apache Spark, with its unparalleled capabilities...]]></description><link>https://vaishnave.page/dabbling-with-spark-essentials</link><guid isPermaLink="true">https://vaishnave.page/dabbling-with-spark-essentials</guid><category><![CDATA[spark]]></category><category><![CDATA[apache]]></category><category><![CDATA[#apache-spark]]></category><category><![CDATA[sparksql]]></category><category><![CDATA[PySpark]]></category><category><![CDATA[Python]]></category><dc:creator><![CDATA[Vaishnave Subbramanian]]></dc:creator><pubDate>Sat, 16 Mar 2024 16:31:39 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1710610707492/cf574046-1ef4-4f0b-ad2d-ab445af5f04a.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Embarking on the journey of understanding Apache Spark marks the beginning of an exciting series designed for both newcomers and myself, as we navigate the complexities of big data processing together. Apache Spark, with its unparalleled capabilities in analytics and data processing, stands as a pivotal technology in the realms of data science and engineering. This series kicks off with "Dabbling into the Essentials: A Beginner's Guide to Using Apache Spark," an article penned from the perspective of someone who is not just a guide but also a fellow traveler on this learning journey. My aim is to simplify the intricate world of Spark, making its core concepts, architecture, and operational fundamentals accessible to all.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1710601256964/8e01085e-06c5-47ff-8c49-75e2e7dced40.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-core-concepts">Core Concepts</h2>
<p>At its heart, Spark is designed to efficiently process large volumes of data, spreading its tasks across many computers for rapid computation. It achieves this through several core concepts:</p>
<ul>
<li><h3 id="heading-resilient-distributed-datasets-rdds"><strong>Resilient Distributed Datasets (RDDs)</strong></h3>
<p>  Resilient Distributed Datasets (RDDs) are a fundamental concept in Spark. They are immutable distributed collections of objects. RDDs can be created through deterministic operations on either data in storage or other RDDs. They are fault-tolerant, as they track the lineage of transformations applied to them, allowing for re-computation in case of node failures. (Analogous to Git version tracking)</p>
</li>
<li><h3 id="heading-directed-acyclic-graph-dag"><strong>Directed Acyclic Graph (DAG)</strong></h3>
<p>  Spark uses DAG to optimize workflows. Unlike traditional MapReduce, which executes tasks in a linear sequence, Spark processes tasks in a graph structure, allowing for more efficient execution of complex computations.</p>
</li>
<li><h3 id="heading-dataframe-and-dataset"><strong>DataFrame and Dataset</strong></h3>
<p>  Introduced in later versions, these abstractions are similar to RDDs but offer richer optimizations. They allow you to work with structured data more naturally and leverage Spark SQL for querying.</p>
</li>
</ul>
<h2 id="heading-key-features">Key Features</h2>
<p>Apache Spark is known for its remarkable speed, versatile nature, and ease of use. Here's why:</p>
<ul>
<li><h3 id="heading-speed"><strong>Speed</strong></h3>
<p>  Spark's in-memory data processing is vastly faster than traditional disk-based processing for certain types of applications. It can run up to 100x faster than Hadoop MapReduce for in-memory processing and 10x faster for disk-based processing.</p>
</li>
<li><h3 id="heading-ease-of-use"><strong>Ease of Use</strong></h3>
<p>  With high-level APIs in Java, Scala, Python, and R, Spark allows developers to quickly write applications. Spark's concise and expressive syntax significantly reduces the amount of code required.</p>
</li>
<li><h3 id="heading-advanced-analytics"><strong>Advanced Analytics</strong></h3>
<p>  Beyond simple map and reduce operations, Spark supports SQL queries, streaming data, machine learning (ML), and graph processing. These capabilities make it a comprehensive platform for complex analytics on big data.</p>
</li>
</ul>
<h2 id="heading-architecture">Architecture</h2>
<p>Apache Spark runs on a cluster, the basic structure of which includes a master node and worker nodes. The master node runs the Spark Context, which orchestrates the activities of the cluster. Worker nodes execute the tasks assigned to them. This design allows Spark to process data in parallel, significantly speeding up tasks.</p>
<h2 id="heading-deployment">Deployment</h2>
<p>Spark can run standalone, on EC2, on Hadoop YARN, on Mesos, or in Kubernetes. This flexibility means it can be integrated into existing infrastructure with minimal hassle.</p>
<h2 id="heading-use-cases">Use Cases</h2>
<ul>
<li><p><strong>Real-Time Stream Processing:</strong> Spark's streaming capabilities allow for processing live data streams from sources like Kafka and Flume.</p>
</li>
<li><p><strong>Machine Learning:</strong> Spark MLlib provides a suite of machine learning algorithms for classification, regression, clustering, and more.</p>
</li>
<li><p><strong>Interactive SQL Queries:</strong> Spark SQL lets you run SQL/HQL queries on big data.</p>
</li>
</ul>
<h2 id="heading-getting-started">Getting Started</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1710601411579/11c43133-1806-4618-958e-72490a733920.png" alt class="image--center mx-auto" /></p>
<p>The map above lists the functions and methods used to perform basic operations in Pyspark. The code for each is listed in the headings below.</p>
<h3 id="heading-understanding-dataframes-and-catalogs">Understanding DataFrames and Catalogs</h3>
<p>The image in the map above demonstrates the lifecycle and scope of data structures and components in a typical Spark application, focusing on how the user's code creates and interacts with DataFrames and how those relate to the Spark Cluster and the Catalog within a session.</p>
<p>The arrows show the relationships or interactions between these components:</p>
<ul>
<li><p>The user creates and interacts with DataFrames locally in their Spark application through SparkContext.</p>
</li>
<li><p>The user can register DataFrames as temporary views in the catalog, making it possible to query them using SQL syntax.</p>
</li>
<li><p>The user can also directly query tables registered in the catalog using the <code>.table()</code> method through the SparkSession.</p>
</li>
<li><p>While the Catalog is accessed via SparkSession, the actual DataFrames are managed within the user's application and they interact with the cluster to distribute the computation.</p>
</li>
</ul>
<p>Spark DataFrames are immutable, meaning once created, they cannot be altered. Any transformation creates a new DataFrame.</p>
<h3 id="heading-sparkcontext-and-sparksession">SparkContext and SparkSession</h3>
<p>The <code>SparkContext</code> is the entry point to any Spark application, providing a connection to the Spark cluster, while the <code>SparkSession</code> offers an interface to that connection, enabling higher-level operations and easier interaction with Spark functionalities.</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> SparkSession

<span class="hljs-comment"># Creating SparkSession (which also provides SparkContext)</span>
spark = SparkSession.builder \
    .appName(<span class="hljs-string">"Spark Example"</span>) \
    .getOrCreate()

<span class="hljs-comment"># SparkContext example, accessing through SparkSession</span>
sc = spark.sparkContext
</code></pre>
<h3 id="heading-sparksessionbuildergetorcreate-method">SparkSession.builder.getOrCreate() method</h3>
<pre><code class="lang-python"><span class="hljs-comment"># This returns an existing SparkSession if there's already one in </span>
<span class="hljs-comment"># the environment, or creates a new one if necessary.</span>
spark = SparkSession.builder \
    .appName(<span class="hljs-string">"Spark Example"</span>) \
    .getOrCreate()
</code></pre>
<h3 id="heading-catalog">Catalog</h3>
<p>This is a metadata repository within Spark SQL. It keeps track of all the data and metadata (like databases, tables, functions, etc.) in a structured format. The catalog is available via the SparkSession, which is the unified entry point for reading, writing, and configuring data in Spark.</p>
<pre><code class="lang-python"><span class="hljs-comment"># Lists all the tables inside the cluster</span>
tables_list = spark.catalog.listTables()
print(tables_list)
print([table.name <span class="hljs-keyword">for</span> table <span class="hljs-keyword">in</span> tables_list])
</code></pre>
<h3 id="heading-sparksessionsql-and-show-methods">SparkSession.sql() and .show() methods</h3>
<pre><code class="lang-python"><span class="hljs-comment"># This method takes a string containing the query and returns a </span>
<span class="hljs-comment"># DataFrame with the results</span>
result_df = spark.sql(<span class="hljs-string">"SELECT * FROM tableName WHERE someColumn &gt; 10"</span>)
result_df.show()
</code></pre>
<h3 id="heading-topandas-method">.toPandas() method</h3>
<pre><code class="lang-python"><span class="hljs-comment"># Convert Spark DataFrame to pandas DataFrame</span>
pandas_df = result_df.toPandas()
</code></pre>
<h3 id="heading-createdataframe-and-createorreplacetempview">createDataFrame() and .createOrReplaceTempView()</h3>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

<span class="hljs-comment"># Assuming pandas_df is a pandas DataFrame</span>
spark_df = spark.createDataFrame(pandas_df)

<span class="hljs-comment"># The DataFrame is now in Spark but not in the SparkSession catalog </span>
spark_df.createOrReplaceTempView(<span class="hljs-string">"temp_table"</span>)
</code></pre>
<p><strong>createOrReplaceTempView():</strong> This method is a part of the DataFrame API. When you call this method on a DataFrame, it creates (or replaces if it already exists) a temporary view in the Spark catalog. This view is temporary in that it will disappear if the session that created it terminates. The temporary view allows the data to be queried as a table in Spark SQL.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">Looking back at the "Understanding DataFrames and Catalogs" section can provide more clarity on the role of <code>createOrReplaceTempView()</code>.</div>
</div>

<h3 id="heading-readcsv-method">.read.csv() method</h3>
<pre><code class="lang-python"><span class="hljs-comment"># Convert a CSV file to a Spark DataFrame</span>
df = spark.read.csv(<span class="hljs-string">"path/to/your/file.csv"</span>, header=<span class="hljs-literal">True</span>)
</code></pre>
<h3 id="heading-filter-method">filter method</h3>
<pre><code class="lang-python"><span class="hljs-comment"># Equivalent to WHERE clause in SQL</span>
filtered_df = df.filter(df[<span class="hljs-string">"some_column"</span>] &gt; <span class="hljs-number">10</span>)
</code></pre>
<h3 id="heading-select-method">select method</h3>
<pre><code class="lang-python"><span class="hljs-comment"># SELECT clause</span>
selected_df = df.select(<span class="hljs-string">"column1"</span>, <span class="hljs-string">"column2"</span>)
</code></pre>
<h3 id="heading-selectexpr-method">selectExpr method</h3>
<pre><code class="lang-python"><span class="hljs-comment"># SELECT which takes SQL expressions as input</span>
expr_df = df.selectExpr(<span class="hljs-string">"column1 as new_column_name"</span>, <span class="hljs-string">"abs(column2)"</span>)
</code></pre>
<h3 id="heading-groupby-and-aggregation-functions">groupBy() and aggregation functions</h3>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> pyspark.sql <span class="hljs-keyword">import</span> functions <span class="hljs-keyword">as</span> F

grouped_df = df.groupBy(<span class="hljs-string">"someColumn"</span>).agg(
    F.min(<span class="hljs-string">"anotherColumn"</span>),
    F.max(<span class="hljs-string">"anotherColumn"</span>),
    F.count(<span class="hljs-string">"anotherColumn"</span>),
    F.stddev(<span class="hljs-string">"anotherColumn"</span>)
)
</code></pre>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">For more clarity, refer to the equivalent SQL code below:</div>
</div>

<pre><code class="lang-sql"><span class="hljs-keyword">SELECT</span> some_column,
       <span class="hljs-keyword">MIN</span>(another_column) <span class="hljs-keyword">AS</span> min_another_column,
       <span class="hljs-keyword">MAX</span>(another_column) <span class="hljs-keyword">AS</span> max_another_column,
       <span class="hljs-keyword">COUNT</span>(another_column) <span class="hljs-keyword">AS</span> count_another_column,
       <span class="hljs-keyword">STDDEV</span>(another_column) <span class="hljs-keyword">AS</span> stddev_another_column
<span class="hljs-keyword">FROM</span> df
<span class="hljs-keyword">GROUP</span> <span class="hljs-keyword">BY</span> some_column
</code></pre>
<h3 id="heading-withcolumn-and-withcolumnrenamed-methods">.withColumn() and .withColumnRenamed() methods</h3>
<pre><code class="lang-python"><span class="hljs-comment"># Adding a new column or replacing an existing one</span>
new_df = df.withColumn(<span class="hljs-string">"new_column"</span>, df[<span class="hljs-string">"some_column"</span>] * <span class="hljs-number">2</span>)

<span class="hljs-comment"># Renaming a column</span>
renamed_df = df.withColumnRenamed(<span class="hljs-string">"old_name"</span>, <span class="hljs-string">"new_name"</span>)
</code></pre>
<h3 id="heading-join-method">join method</h3>
<pre><code class="lang-python"><span class="hljs-comment"># Joining two tables</span>
joinedDF = table1.join(table2, <span class="hljs-string">"common_column"</span>, <span class="hljs-string">"leftouter"</span>)
</code></pre>
<h3 id="heading-cast-method">.cast() method</h3>
<pre><code class="lang-python"><span class="hljs-comment"># Casting a column to a different type</span>
df = df.withColumn(<span class="hljs-string">"some_column"</span>, df[<span class="hljs-string">"some_column"</span>].cast(<span class="hljs-string">"integer"</span>))
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In wrapping up our exploration of the basics of Apache Spark, it's clear that this powerful framework stands out for its ability to process large volumes of data efficiently across distributed systems. Spark’s versatile ecosystem, which encompasses Spark SQL, DataFrames, and advanced analytics libraries, empowers data engineers and scientists to tackle a wide array of data processing and analysis tasks. Keep an eye out for future articles in this Spark series, where I'll delve deeper into Spark's architecture, execution models, and optimization strategies, providing you with a comprehensive understanding to leverage its full capabilities.</p>
]]></content:encoded></item><item><title><![CDATA[From 'My Way' to the Hive Way]]></title><description><![CDATA[In the vast world of big data processing, Apache Hive has emerged as a powerful tool for querying and analyzing large datasets stored in distributed storage systems like Hadoop. However, as the volume and complexity of data continue to grow, optimizi...]]></description><link>https://vaishnave.page/from-my-way-to-the-hive-way</link><guid isPermaLink="true">https://vaishnave.page/from-my-way-to-the-hive-way</guid><category><![CDATA[data]]></category><category><![CDATA[dataengineering]]></category><category><![CDATA[hive]]></category><category><![CDATA[hadoop]]></category><category><![CDATA[optimization]]></category><dc:creator><![CDATA[Vaishnave Subbramanian]]></dc:creator><pubDate>Thu, 11 May 2023 13:25:02 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1683719801786/d82163f9-ec8a-41e7-bb23-8be9907654be.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the vast world of big data processing, Apache Hive has emerged as a powerful tool for querying and analyzing large datasets stored in distributed storage systems like Hadoop. However, as the volume and complexity of data continue to grow, optimizing Hive's performance becomes crucial. In this article, we will delve into the realm of hive optimization, exploring various strategies and techniques to boost query performance and maximize the efficiency of your data processing pipeline.</p>
<h1 id="heading-structure-level-optimization">Structure Level Optimization</h1>
<p>Optimizing Hive for maximum performance requires not only writing better queries but paying attention to schema and structure. By carefully designing data models, balancing normalization and denormalization, and considering factors like bucketing, partitioning, indexing, and compression, you can significantly improve the performance of your Hive environment. Let's dive into bucketing and partitioning.</p>
<h2 id="heading-partitioning">Partitioning</h2>
<p>Partitioning involves dividing a table into smaller, more manageable parts based on a particular column or set of columns. This can help with data organization, querying, and filtering. I will go over the two types of partitioning - Static and Dynamic Partitioning. I will also dive into a common issue with partitioning - Over Partitioning.</p>
<h3 id="heading-static-partitioning">Static Partitioning</h3>
<p>In static partitioning, the partition values are explicitly specified <strong>manually</strong> during the data insertion process. Static partitions require prior knowledge of the partition values, and separate INSERT statements are needed for each partition. It provides better control and predictability over partitioning as the partitions are pre-defined and known in advance. Static partitions are typically suitable for scenarios where the partitioning criteria are well-defined and stable, such as partitioning data by year or region.</p>
<h3 id="heading-dynamic-partitioning">Dynamic Partitioning</h3>
<p>Dynamic partitioning, also known as <strong>automatic partitioning</strong>, allows Hive to <strong>automatically determine the partitions</strong> based on the values present in the data being inserted. With dynamic partitions, a single INSERT statement can insert data into multiple partitions without explicitly specifying partition values. Dynamic partitions are flexible and can adapt to changes in data without requiring manual intervention. This type of partitioning is useful when the partitioning criteria are not known in advance or when new partitions need to be created frequently, such as partitioning data by date or customer ID.</p>
<h3 id="heading-dealing-with-over-partitioning">Dealing With Over Partitioning</h3>
<p>When working with Hive and HDFS, it's important to consider the drawbacks of having too many partitions. The creation of numerous Hadoop files and directories for each partition can quickly overwhelm the NameNode's capacity to manage the filesystem metadata. As the number of partitions grows, the strain on memory and the potential for exhausting metadata capacity increases.</p>
<p>Additionally, MapReduce processing introduces overhead in the form of JVM start-up and tear-down, which can be significant when dealing with small files.</p>
<p>When implementing time-range partitioning in Hive, it is recommended to estimate the data size across different time granularities. Begin with a granularity that ensures controlled partition growth over time, while ensuring that each partition's file size is at least equal to the filesystem block size or its multiples. This approach optimizes query performance for general queries. However, it's crucial to consider when the next level of granularity becomes appropriate, particularly if query WHERE clauses often select ranges of smaller granularities.</p>
<p>Another approach is to employ two levels of partitions based on different dimensions. For example, you can partition by day at the first level and by geographic region (e.g., state) at the second level. This allows for more targeted querying, such as retrieving data for a specific day within a particular state.</p>
<p>However, it is worth noting that processing imbalances may occur when handling larger states that have significantly more data, resulting in longer processing times compared to smaller states with lesser data volumes. This is where bucketing comes into the picture.</p>
<h2 id="heading-bucketing">Bucketing</h2>
<p>Bucketing involves grouping data based on a hash function applied to one or more columns. This can improve query performance by reducing the amount of data that needs to be scanned.</p>
<h2 id="heading-use-case">Use Case</h2>
<ul>
<li><p>Partitioning is a useful technique when the cardinality or number of unique values in a column is low. For example, if you have a table of sales data with a "year" column, partitioning the data based on this column can make it easier to filter and analyze sales for specific years.</p>
<p>  Bucketing, on the other hand, is ideal when the cardinality of a column is high.</p>
</li>
</ul>
<p>See the image below on how Hive stores partitions and buckets. There are two partitions based on product ID (P1 and P2) and 3 buckets (bucket 0, bucket 1 and bucket 2).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1683721308358/a0e4270b-0cfd-4a95-9c45-6e7de8962d8b.jpeg" alt class="image--center mx-auto" /></p>
<ul>
<li>Use the function -</li>
</ul>
<p><code>SHOW PARTITIONS sales_table</code></p>
<p>to see all the partitions in your table.</p>
<ul>
<li>Use the following query to see the bucket</li>
</ul>
<p><code>SELECT * FROM sales_table TABLESAMPLE (bucket 1 out of 3);</code></p>
<p>Change the bucket number to 1, 2 and 3 to see the data in each bucket.</p>
<h1 id="heading-query-level-optimization">Query Level Optimization</h1>
<p>Join optimization techniques in Hive aim to improve the performance and efficiency of join operations, which are essential for querying and analyzing large datasets.</p>
<p>Join optimization techniques in Hive involve the use of hashtables and leveraging distributed processing to improve performance. The process generally follows these steps:</p>
<ol>
<li><p>A hashtable is created from the smaller table involved in the join operation.</p>
</li>
<li><p>The hashtable is moved to HDFS for storage and accessibility.</p>
</li>
<li><p>From HDFS, the hashtable is broadcasted to all nodes in the cluster, ensuring it is available locally.</p>
</li>
<li><p>The hashtable is then stored in the distributed cache, residing on the local disk of each node.</p>
</li>
<li><p>Each node loads the hashtable into memory from its local disk, allowing for efficient in-memory access.</p>
</li>
<li><p>Finally, the MapReduce job is invoked to perform the join operation, utilizing the loaded hashtables to optimize the process.</p>
</li>
</ol>
<p>Here is a summary of common join optimization techniques used in Hive:</p>
<h2 id="heading-map-side-join-optimization">Map Side Join Optimization</h2>
<p>In certain cases, if one of the tables being joined is small enough to fit in memory, Hive can perform a map-side join. This technique avoids costly shuffling and reduces the need for network transfers by distributing the smaller table to all mapper tasks.</p>
<h3 id="heading-settings">Settings</h3>
<p><code>SET hive.auto.convert.join=true; -- default true after v0.11.0</code></p>
<p><code>SET hive.mapjoin.smalltable.filesize=600000000; -- default 25m</code></p>
<p><code>SET hive.auto.convert.join.noconditionaltask=true; -- default value above is true so map join hint is not needed</code></p>
<p><code>SET hive.auto.convert.join.noconditionaltask.size=10000000; -- default value above controls the size of table to fit in memory</code></p>
<p>With join auto-convert enabled in Hive, the system will automatically assess if the file size of the smaller table exceeds the specified threshold defined by hive.mapjoin.smalltable.filesize. If the file size is smaller than the threshold, Hive will attempt to convert the common join into a map join. This automatic conversion eliminates the need to explicitly provide map join hints in the query, simplifying the query writing process and reducing manual optimization efforts. By leveraging this feature, users can optimize join operations without the need for manual intervention.</p>
<h3 id="heading-code">Code</h3>
<pre><code class="lang-sql"><span class="hljs-comment">-- Perform a map join on two tables </span>
<span class="hljs-keyword">SELECT</span>  <span class="hljs-comment">/*+ MAPJOIN(table2) */</span> *
<span class="hljs-keyword">FROM</span> table1 
<span class="hljs-keyword">JOIN</span> table2 <span class="hljs-keyword">ON</span> table1.key = table2.key;
</code></pre>
<h3 id="heading-conditions">Conditions</h3>
<ul>
<li>Except for one big table, all other tables in the join operation should be small.</li>
</ul>
<p>Here is a quick guide for the different joins and which table should be the smaller one.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Type of join</td><td>Small table</td></tr>
</thead>
<tbody>
<tr>
<td>Inner Join</td><td>Left table must be small</td></tr>
<tr>
<td>Left Join</td><td>Right table must be small</td></tr>
<tr>
<td>Right Join</td><td>Left table must be small</td></tr>
<tr>
<td>Full Outer Join</td><td>Cannot be treated as a map side join</td></tr>
</tbody>
</table>
</div><h2 id="heading-bucket-map-join-optimization">Bucket Map Join Optimization</h2>
<p>When both tables are bucketed on the join key, Hive can leverage bucketed map joins. This technique allows data with the same bucket ID to be processed locally, minimizing data movement and reducing the need for a full shuffle.</p>
<p>It's important to note that unlike map side join,</p>
<ul>
<li><p>it can be done on two big tables.</p>
</li>
<li><p>Only one bucket is loaded in memory essentially.</p>
</li>
</ul>
<h3 id="heading-settings-1">Settings</h3>
<p><code>SET</code> <a target="_blank" href="http://hive.auto"><code>hive.auto</code></a><code>.convert.join=true;</code></p>
<p><code>SET hive.optimize.bucketmapjoin=true; -- default false</code></p>
<h3 id="heading-code-1">Code</h3>
<pre><code class="lang-sql"><span class="hljs-comment">-- Perform a bucketed map join on two tables</span>
<span class="hljs-keyword">SELECT</span> <span class="hljs-comment">/*+ MAPJOIN(table2) */</span> *
<span class="hljs-keyword">FROM</span> table1
<span class="hljs-keyword">JOIN</span> table2 <span class="hljs-keyword">ON</span> table1.key = table2.key
CLUSTERED <span class="hljs-keyword">BY</span> (<span class="hljs-keyword">key</span>) <span class="hljs-keyword">INTO</span> &lt;num_of_buckets&gt;;
</code></pre>
<h3 id="heading-conditions-1">Conditions</h3>
<ul>
<li><p>Both tables should be bucketed on the join column.</p>
</li>
<li><p>Join tables must have integral multiple of buckets. Eg - If table 1 has 4 buckets, table 2 can have 4,8,12, etc buckets.</p>
</li>
</ul>
<h2 id="heading-sort-merge-bucket-join-optimization">Sort Merge Bucket Join Optimization</h2>
<p>When both tables are bucketed and sorted on the join key, Hive can use sort merge bucketed joins. This technique avoids unnecessary data shuffling and performs an efficient merge operation on the sorted buckets, enhancing join performance. This join can be done on two big tables.</p>
<h3 id="heading-settings-2">Settings</h3>
<p><code>SET hive.input.format= org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;</code></p>
<p><code>SET hive.auto.convert.sortmerge.join=true;</code></p>
<p><code>SET hive.optimize.bucketmapjoin=true;</code></p>
<p><code>SET hive.optimize.bucketmapjoin.sortedmerge=true;</code></p>
<p><code>SET hive.auto.convert.sortmerge.join.noconditionaltask=true;</code></p>
<h3 id="heading-code-2">Code</h3>
<pre><code class="lang-sql"><span class="hljs-comment">-- Perform a sort merge bucketed join on two tables</span>
<span class="hljs-keyword">SELECT</span> <span class="hljs-comment">/*+ SMBJOIN(table2) */</span> *
<span class="hljs-keyword">FROM</span> table1
<span class="hljs-keyword">JOIN</span> table2 <span class="hljs-keyword">ON</span> table1.key = table2.key
CLUSTERED <span class="hljs-keyword">BY</span> (<span class="hljs-keyword">key</span>) SORTED <span class="hljs-keyword">BY</span> (<span class="hljs-keyword">key</span>) <span class="hljs-keyword">INTO</span> &lt;num_of_buckets&gt;;
</code></pre>
<h3 id="heading-conditions-2">Conditions</h3>
<ul>
<li><p>Both tables should be bucketed on the join column.</p>
</li>
<li><p>The number of buckets in one table must be equal to the number of buckets in the other table.</p>
</li>
<li><p>Both tables should be sorted based on the join column.</p>
</li>
</ul>
<h2 id="heading-sort-merge-bucket-map-join-optimization">Sort Merge Bucket Map Join Optimization</h2>
<p>An SMBM (Sort-Merge-Bucket-Merge) join is a specific type of bucket join that exclusively triggers a map-side join. Unlike a traditional map join that requires caching all rows in memory, an SMBM join avoids this overhead.</p>
<h3 id="heading-settings-3">Settings</h3>
<p><code>SET hive.auto.convert.sortmerge.join=true</code></p>
<p><code>SET hive.optimize.bucketmapjoin=true;</code></p>
<p><code>SET hive.optimize.bucketmapjoin.sortedmerge=true;</code></p>
<p><code>SET hive.auto.convert.sortmerge.join.noconditionaltask=true;</code></p>
<p><code>SET hive.auto.convert.sortmerge.join.bigtable.selection.policy= org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSM J;</code></p>
<h3 id="heading-conditions-3">Conditions</h3>
<ul>
<li><p>Both tables should be bucketed on the join column.</p>
</li>
<li><p>The number of buckets in one table must be equal to the number of buckets in the other table.</p>
</li>
<li><p>Both tables should be sorted based on the join column.</p>
</li>
</ul>
<h2 id="heading-skew-join-optimization">Skew Join Optimization</h2>
<p>When dealing with data that exhibits a significant imbalance in distribution, data skew can occur, leading to a situation where a few compute nodes bear the brunt of the computation workload.</p>
<h3 id="heading-settings-4">Settings</h3>
<p><code>SET hive.optimize.skewjoin=true; --If there is data skew in join, set it to true. Default is false.</code></p>
<p><code>SET hive.skewjoin.key=100000; --This is the default value. If the number of key is bigger than --this, the new keys will send to the other unused reducers.</code></p>
<p><code>SET hive.groupby.skewindata=true;</code></p>
<p>Data skew can also occur when performing GROUP BY operations, leading to uneven distribution and performance bottlenecks. To address this, Hive offers the setting above, enabling skew data optimization specifically for GROUP BY results. Once enabled, Hive initiates an additional MapReduce job that strategically redistributes the map output across reducers in a random manner, effectively mitigating data skew and improving overall query performance. By configuring this setting, Hive ensures a more balanced workload distribution and optimized processing for GROUP BY operations affected by skewed data.</p>
<h3 id="heading-code-3">Code</h3>
<pre><code class="lang-sql"><span class="hljs-comment">-- Perform a join with skew join optimization</span>
<span class="hljs-keyword">SELECT</span> <span class="hljs-comment">/*+ SKEWJOIN(table1) */</span> *
<span class="hljs-keyword">FROM</span> table1
<span class="hljs-keyword">JOIN</span> table2 <span class="hljs-keyword">ON</span> table1.key = table2.key;
</code></pre>
]]></content:encoded></item><item><title><![CDATA[Python Fundamentals in a Nutshell]]></title><description><![CDATA[After some time off from Python, I decided to revisit some old concepts as well as learn a few new ones. Thought of summarizing them here:

If you are performing the same calculation within a function for different input arguments, use nested functio...]]></description><link>https://vaishnave.page/python-fundamentals-in-a-nutshell</link><guid isPermaLink="true">https://vaishnave.page/python-fundamentals-in-a-nutshell</guid><category><![CDATA[Python]]></category><category><![CDATA[coding]]></category><category><![CDATA[best practices]]></category><dc:creator><![CDATA[Vaishnave Subbramanian]]></dc:creator><pubDate>Sun, 23 Apr 2023 16:28:15 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1682267038536/12224325-5266-4149-b9d6-f9f6816f2b93.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>After some time off from Python, I decided to revisit some old concepts as well as learn a few new ones. Thought of summarizing them here:</p>
<ul>
<li>If you are performing the same calculation within a function for different input arguments, use nested functions.</li>
</ul>
<p>This nested function can take one input argument and return the value after the calculation is performed on that argument. In the main (outer) function, return this nested function with the different input arguments passed to it.</p>
<p>So your initial function definition:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">square_three_numbers</span>(<span class="hljs-params">n1,n2,n3</span>):</span>
    sq_n1 = n1**<span class="hljs-number">2</span>
    sq_n2 = n2**<span class="hljs-number">2</span>
    sq_n3 = n3**<span class="hljs-number">2</span>
    <span class="hljs-keyword">return</span> (sq_n1,sq_n2,sq_n3)
</code></pre>
<p>Becomes:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">square_three_numbers</span>(<span class="hljs-params">n1,n2,n3</span>):</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">square_number</span>(<span class="hljs-params">number</span>):</span>
        <span class="hljs-keyword">return</span> (number**<span class="hljs-number">2</span>)
    <span class="hljs-keyword">return</span> (square_number(n1),square_number(n2),square_number(n3))
</code></pre>
<p>The second snippet is a way more efficient way of modularizing your code.</p>
<ul>
<li>Use flexible arguments - '*args' and '**kwargs' when you don't know how many arguments you will be passing.</li>
</ul>
<p>Use '*args' when you are just plainly passing the arguments. This gets passed into the function body as a tuple and then you can iterate through it to get your arguments.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">sum_numbers</span>(<span class="hljs-params">*args</span>):</span>
    total=<span class="hljs-number">0</span>
    <span class="hljs-keyword">for</span> number <span class="hljs-keyword">in</span> args:
        total=total+number
    <span class="hljs-keyword">return</span>(total)

<span class="hljs-comment">#Function call</span>
print(sum_numbers(<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>))

<span class="hljs-comment">#Output:</span>
<span class="hljs-comment">#6</span>
</code></pre>
<p>Use '**kwargs' when you pass an identifier before your argument value. This gets passed into the function body as a dictionary and then you can iterate through it to get your arguments (with keys being the identifiers and values being the argument value).</p>
<pre><code class="lang-python"><span class="hljs-comment">#Function definition</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">my_favourites</span>(<span class="hljs-params">**kwargs</span>):</span>
    <span class="hljs-keyword">for</span> key,value <span class="hljs-keyword">in</span> kwargs.items():
        print(<span class="hljs-string">"My favourite {0} is \'{1}\'!"</span>.format(key,value))



<span class="hljs-comment">#Function call</span>
my_favourites(series=<span class="hljs-string">"Handmaid's Tale"</span>, movie=<span class="hljs-string">"Rogue One: A Star Wars Story"</span>,song=<span class="hljs-string">"Less I Know The Better"</span>)


<span class="hljs-comment">#Output:</span>
<span class="hljs-comment">#My favourite series is 'Handmaid's Tale'!</span>
<span class="hljs-comment">#My favourite movie is 'Rogue One: A Star Wars Story'!</span>
<span class="hljs-comment">#My favourite song is 'Less I Know The Better'!</span>
</code></pre>
<p>Is it weird that I took more time to decide my favourite movie than to write this code?</p>
<p>Notice the format() method for string. This is way easier than trying to concatenate strings using +.</p>
<ul>
<li>Lambda functions (anonymous function) - Used usually when you want to pass a function as an argument to higher-order functions.</li>
</ul>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">exponent_function</span>(<span class="hljs-params">exp</span>):</span>
    <span class="hljs-keyword">return</span> <span class="hljs-keyword">lambda</span> num : num ** exp

raised_to_10 = exponent_function(<span class="hljs-number">10</span>)
raised_to_5 = exponent_function(<span class="hljs-number">5</span>)

print(raised_to_10(<span class="hljs-number">2</span>))
print(raised_to_5(<span class="hljs-number">2</span>))

<span class="hljs-comment">#Output:</span>
<span class="hljs-comment">#1024</span>
<span class="hljs-comment">#32</span>
</code></pre>
<p>Lambda functions are used with:</p>
<blockquote>
<p>map(function,sequence) - Applies the lambda function to all the elements in the sequence. Returns a map object.</p>
</blockquote>
<pre><code class="lang-python">numbers = [<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>]


<span class="hljs-comment">#Use map() to apply the lambda function for all elements in your list</span>
squares = map(<span class="hljs-keyword">lambda</span> num:num**<span class="hljs-number">2</span>, numbers)


<span class="hljs-comment">#Remember, squares is a map object. So, convert it into a list to see what the values are!</span>
print(list(squares))

<span class="hljs-comment">#Output:</span>
<span class="hljs-comment">#[1,4,9,16]</span>
</code></pre>
<blockquote>
<p>filter(function,sequence) - Applies the lambda function with a condition to the elements in the sequence. Returns a filter object.</p>
</blockquote>
<pre><code class="lang-python">numbers = [<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,<span class="hljs-number">5</span>,<span class="hljs-number">6</span>,<span class="hljs-number">7</span>]


<span class="hljs-comment">#Use the lambda function in filter to print even numbers from the list</span>
even_numbers = filter(<span class="hljs-keyword">lambda</span> nums:nums%<span class="hljs-number">2</span>==<span class="hljs-number">0</span>,numbers)


<span class="hljs-comment">#Remember, even_numbers is a filter object. So, convert it into a list to see what the values are!</span>
print(list(even_numbers))

<span class="hljs-comment">#Output</span>
<span class="hljs-comment">#[2, 4, 6]</span>
</code></pre>
<blockquote>
<p>reduce(function,sequence) - Applies the lambda function to all the elements in the sequence. Returns a single value.</p>
</blockquote>
<pre><code class="lang-python">numbers = [<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>]


<span class="hljs-comment"># Use reduce() to apply a lambda function over numbers</span>
product = reduce(<span class="hljs-keyword">lambda</span> item1,item2:item1*item2, numbers)


<span class="hljs-comment"># Print the result</span>
print(product)

<span class="hljs-comment">#Output:</span>
<span class="hljs-comment">#24</span>
</code></pre>
<p>The function is first applied to the first two elements. The result of this computation and the third value are again passed into the function and so on.</p>
<ul>
<li>Iterators and Iterables - For efficient and flexible processing of data, especially when working with large datasets.</li>
</ul>
<p><strong>Iterables</strong></p>
<pre><code class="lang-python">fruits = [<span class="hljs-string">"apple"</span>, <span class="hljs-string">"banana"</span>, <span class="hljs-string">"orange"</span>]
<span class="hljs-keyword">for</span> fruit <span class="hljs-keyword">in</span> fruits:
    print(fruit)
</code></pre>
<p>Output:</p>
<pre><code class="lang-python">apple
banana
orange
</code></pre>
<p>In this example, the <code>fruits</code> list is an iterable object, and we can loop over its elements using a <code>for</code> loop.</p>
<p><strong>Iterators</strong></p>
<pre><code class="lang-python">fruits = [<span class="hljs-string">"apple"</span>, <span class="hljs-string">"banana"</span>, <span class="hljs-string">"orange"</span>]
fruit_iterator = iter(fruits)
print(next(fruit_iterator))
print(next(fruit_iterator))
print(next(fruit_iterator))
</code></pre>
<p>Output:</p>
<pre><code class="lang-python">apple
banana
orange
</code></pre>
<p>In this example, we create an iterator object <code>fruit_iterator</code> using the <code>iter()</code> function. We then use the <code>next()</code> function to iterate over the elements of the <code>fruits</code> list. Each call to <code>next()</code> returns the next element in the list until there are no more elements to iterate over.</p>
<ul>
<li>List Comprehensions - <strong>For loop reimagined.</strong></li>
</ul>
<p>When this code:</p>
<pre><code class="lang-python">my_list=[]
<span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">10</span>):
    <span class="hljs-keyword">for</span> j <span class="hljs-keyword">in</span> range(<span class="hljs-number">5</span>):
        my_list.append((j,i))
print(my_list)
</code></pre>
<p>becomes this:</p>
<pre><code class="lang-python">my_list_comp=[(j,i) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">10</span>) <span class="hljs-keyword">for</span> j <span class="hljs-keyword">in</span> range(<span class="hljs-number">5</span>) ]
print(my_list_comp)
</code></pre>
<p>Your program will send you a pop-up saying 'Thank you!'</p>
<p>Seriously. Try it.</p>
<p>However, to ensure that you stay <code>pythonic</code>, it is advisable to have a clear understanding of when to use list comprehensions and when not to.</p>
]]></content:encoded></item></channel></rss>