Is PySpark becoming the ETL standard?
Written by Jared Hillam
The transformations in ETL are the meat of an ETL tool. A corporation's intellectual property for how their data is reshaped is what makes an ETL tool sticky to organizations that use it. Most companies don’t have a great understanding of what those transformations are even doing. They just know that the data behaves itself when it is queried. (well… most of the time) If they really did want to know what that ETL tool was doing, most ETL tools provided a graphical representation of the transformations. Even then, the details that drove those logical steps drilled down into deeper logical definition screens.
For 30 years ETL tools have served the data pipelines of organizations. However, there is a force that is gravitating transformation events to be computed where the data is stored. This is driving companies to look at ways to make their ETL computations more adjacent to the database. This gravitation is also influencing other factors to come into play. For example, the pain of negotiating, planning, and exiting the ETL tools has organizations wanting to look for solutions that are less vendor-locked. But where do they go? The presumption that there is a silver bullet answer is a little fanciful. However, for those that are quite conservative in their approach the future is quite uncertain, so maximizing optionality is high on the agenda. This is producing a trend of PySpark being a target for those parties.
Open and Broad
PySpark is a target that does allow for non-cloud, non-vendor-specific, high-level data processing, which can be particularly attractive for organizations that operate in environments where data sovereignty and security are paramount. By utilizing PySpark, these organizations can keep the data processing close to the data storage layer, often within their own data centers or private clouds, thereby maintaining control and governance over their data workflows.
The allure of PySpark lies in its large open-source adoption. Despite vendors creating their own syntaxes, it still offers an escape from a majority of proprietary clutches of traditional ETL vendors.
With PySpark, data engineers and scientists can script complex transformations using Python, a language with which a vast number of professionals are familiar. This democratization of data processing means that organizations are not just less vendor-reliant, but they are also tapping into a larger pool of talent to maintain and enhance their data pipelines.
Let's take a look at the current landscape of cloud vendors and whether they support PySpark.
- On-premise Hadoop… Yes
- AWS… Yes
- Google Cloud… Yes
- Azure… Yes
- Databricks… Yes
- Snowflake… No (Well, no to the Spark part, but the Python part is Yes)
This broad support has made large conservative organizations that are debating the future take a close look at PySpark as a target platform for ETL jobs. The allure is the ability to negotiate an exit without the organization putting itself at risk operationally. The sheer expense involved in deploying native compute from one cloud platform to another is a force for getting organizations to consider PySpark as a target. This risk is something that conservative organizations very much look at. The last thing they want is to be noosed by a single vendor, and unable to negotiate their costs. So much so that some organizations still haven’t moved to the cloud, but rather tout their own “clouds”. Nevertheless, the target language for them is PySpark.
Go To
All of these advantages see PySpark becoming the “go-to” code base for many ETL modernization efforts. However, this can be overdone. Often in the zeal to modernize to PySpark, organizations will try to convert everything down to their SQL database logic. While it is potentially possible to convert SQL logic to PySpark, it is often overkill. Reconstructing simple ANSI SQL to a far more complex programming structure just ends up creating additional noise in code. Where code can be maintained in SQL, it makes far more sense to do so. With open options like SparkSQL this kind of noise can be avoided.
What you will lose with PySpark is the ETL widgets, as everything will be represented in code. Platforms such as Databricks allow for more organized segmentation of code in their workbooks. Because these widgets are gone, it is important that the code conversion be conducted in a highly structured way so that the patterns persist across ETL transformations. This will enable the organization to ensure the code is at least maintainable and automatable in the future.
Once in a Career
While ETL tools have been sticky for decades, their grip is finally loosening. Organizations that wouldn’t have dreamed of tipping the applecart of their data pipelines are now making a once-in-a-career pipeline change. With this change, they’re becoming more skeptical about future-proofing their next move, and PySpark is one of the targets coming out on top.
Who is Intricity?
Intricity is a specialized selection of over 100 Data Management Professionals, with offices located across the USA and Headquarters in New York City. Our team of experts has implemented in a variety of Industries including, Healthcare, Insurance, Manufacturing, Financial Services, Media, Pharmaceutical, Retail, and others. Intricity is uniquely positioned as a partner to the business that deeply understands what makes the data tick. This joint knowledge and acumen has positioned Intricity to beat out its Big 4 competitors time and time again. Intricity’s area of expertise spans the entirety of the information lifecycle. This means when you’re problem involves data; Intricity will be a trusted partner. Intricity's services cover a broad range of data-to-information engineering needs:
What Makes Intricity Different?
While Intricity conducts highly intricate and complex data management projects, Intricity is first a foremost a Business User Centric consulting company. Our internal slogan is to Simplify Complexity. This means that we take complex data management challenges and not only make them understandable to the business but also make them easier to operate. Intricity does this through using tools and techniques that are familiar to business people but adapted for IT content.
Thought Leadership
Intricity authors a highly sought after Data Management Video Series targeted towards Business Stakeholders at https://www.intricity.com/videos. These videos are used in universities across the world. Here is a small set of universities leveraging Intricity’s videos as a teaching tool:
Talk With a Specialist
If you would like to talk with an Intricity Specialist about your particular scenario, don’t hesitate to reach out to us. You can write us an email: specialist@intricity.com
(C) 2023 by Intricity, LLC
This content is the sole property of Intricity LLC. No reproduction can be made without Intricity's explicit consent.
Intricity, LLC. 244 Fifth Avenue Suite 2026 New York, NY 10001
Phone: 212.461.1100 • Fax: 212.461.1110 • Website: www.intricity.com
Sample@gmail.com appears invalid. We can send you a quick validation email to confirm it's yours