We recently had a call with a large Systems Integrator Partner who had been attempting a code conversion using LLMs (Large Language Models) to modernize a data system. “It was a cluster f#%&”, were the choice words used to start the laundry list of issues they had experienced during the project.

I want to break down in this short whitepaper what some of the benefits of LLMs are in code conversion, the right projects to use them for, and the drawbacks.

LLMs Converting Code

Large Language Models are phenomenal at converting code from one syntax to another. You can test it yourself, go into ChatGPT and ask it to convert a TSQL script into a Databricks or Snowflake script. It will produce it, and it can give you 4 versions of it for you to pick from. In many ways an LLM is similar to a human or even many humans. For this very reason, Universities (and a host of other professions) have been panicking about how to evaluate human effort. A single professor can make a single assignment to the class such as “Write an essay on the impact of Joan of Arc on contemporary culture” and with the same LLM it will produce 40 different versions, and with a little coaching, it can do so in very specific styles of writing. Students can simply hand over the hard work to ChatGPT and suddenly their effort mostly goes into grooming the output to look credible.

This points to one of the key advantages of using an LLM… creativity. They are endlessly creative, and can produce thousands of variations. For those who have trained LLMs, they know that this creativity can become, at times, a frustrating feature, especially when a very specific result is desired. AI companies themselves that are grooming these models struggle to generate consistent responses to very straightforward questions. In some more sensitive cases, like political ones, the AI companies will outright tell the AI to simply not respond.

Personal Experience

I personally attempted to generate a chatbot (Using ChatGPT) to assist a partner community in answering basic questions about BladeBridge. After about a week of training, the bot still couldn’t seem to remember strict answers to critically important questions, particularly when the test questions were asked in an unexpected way. The back-and-forth teaching of the right pattern would seem to work for a moment then, like an Alzheimer's patient, be lost forever. This is why AI requires so much training data, to ensure that certain answers and their permutations can undergo enough weighting to get the right answer. But at the edges of those neural networks, the strictness of answers is very hard to control to a single answer. So when you want a highly strict answer, LLMs struggle. 

This edge case variability usually isn’t a problem in code conversion if we’re only converting 1 script. In fact, with one script, the edge case variability is probably preferable as developers might want to experiment with variations that might produce more efficiency. But when converting 10,000 scripts that variability can be a serious problem.

 

Code Variability

The one area where code variability at scale might seem to be a non-issue is in cases where the legacy code itself has no patterns to begin with. For example, Imagine an organization that simply never “stood on the shoulders” of its prior logic, and built everything from scratch. In such a case, you would think the customer doesn’t have any patterns to speak of anyway, in which case, there isn’t really a concern about the LLM creating random coding patterns. From a compiling perspective, this would certainly be true because the code could be compiled fairly easily. However, the challenge would be in the data testing portion of the code conversion effort. See, at the end of the conversion the modern database has to produce precisely the same data as the legacy database. With such a wide array of patterns, the data integrity testing portion of the project takes on a life of its own, as now each script has to conform to ensure that it produces the right data along with the right syntax. 

This does not mean the project is doomed to failure, it simply means that the data testing iteration will be a much larger portion of the effort as each script is an island unto itself. Additionally, to successfully execute a conversion under these circumstances, it would be advisable to keep both systems (legacy and modern) alive for an extended period to ensure that the data testing outcomes could be tested in parallel so every possible trigger that could cause a discrepancy could be surfaced. This all stems from the patterning from script to script basically being nonexistent in our fictitious case study.

Real World Code

Code without patterns is uncommon in large enterprises. Patterns naturally emerge because developers prefer not to start from scratch when building new features. As a result, some patterns typically exist in most cases. A good example of high patterning is ETL tooling. Such tools commonly have highly structured patterns and processes that define a data pipeline. The more patterned the code is, the justification for automation increases as the patterning can be handled globally. 

Target Code Impacts

Sustaining patterns in code conversion starts to really matter when entering into highly variant coding languages. These would be largely open-ended languages like Python, PySpark, Stored Procedures, Java, etc. These have very low-grain constructs that can be assembled into larger constructs and there are thousands of ways a construct can be represented. 

Imagine now that you have all your code converted to PySpark and you have been supporting the new environment for some time. Imagine that you need to hook another tool into the PySpark code in all the locations just before a data merge occurs. How would you do it? Well if your code had a highly consistent coding sequence for data merges it would be relatively easy to hook into the code base, by just identifying the consistent keywords in the code. However, if your code converter had a lot of varying ways of making the same data merge happen, you would have to hunt and peck your way through the code to identify where those merges might occur. I chose a fairly benign example, but code management events happen every day in data pipelines, so the last thing an organization needs is a code base that has a bunch of sharded patterns, especially when these highly consistent patterns were readily available from the legacy platform or an old ETL tool.

 

Configuration Based Conversion

Configuration-based code/metadata conversion tools are not nearly as sexy-sounding as an LLM code conversion method. However, using a configuration-based approach does not preclude the use of LLM, rather it assigns LLMs to a more appropriate stage in the code conversion process. 

A configuration-based conversion like the ones found in BladeBridge relegates the code conversion definitions to a set of JSON configuration files that are editable and auditable. If a pattern is not desired, the pattern can be modified in the JSON and the Converter can be re-run so the global changes can be taken into account. The output patterns from BladeBridge need not be in large code segments, but can rather be in low grain patterns, having those low grain patterns be consistent, and roll up to similarly consistent high grain patterns, enables a lot of target code control. So where does LLM come into the picture?

Configuration-Based Conversion + LLM

A job converted via a configuration-based approach will commonly convert between 70-90% of each job. This is a good point where LLM can come into the picture. The compile errors found in a job can be run through the LLM along with the target published code to both identify where there might be an error and make recommendations on how to improve the job. Some of BladeBridge’s leading System Integration partners leverage this approach to speed up any manual efforts in taking jobs that are 70-90% converted and making up the latter 10-30%. Special attention needs to be made to check if there are further global configuration changes in BladeBridge that should be rolled into the configuration files. This can ensure that, at a global level, the patterns are sustained as much as possible.

Code the LLM Recognizes

The other advantage of the Configuration Based Conversion + LLM is the ability to have code available to the LLM which it can understand. ETL tools don’t expose their metadata in commonly known languages like PySpark or SQL, rather they sit in obscure XML formats which LLMs couldn’t hope to get enough training data for. Since so many customers are seeking to modernize into languages such as PySpark and SQL, LLMs are well suited for such targets. With BladeBridge providing ready code in the syntaxes that the LLMs understand, acceleration can be far easier to obtain. 

 

Slope of Enlightenment

Gartner created its Hype-Cycle graph many years ago and AI is hitting its peak. There is so much hype that it is difficult to see reality. There’s nothing to scoff at when it comes to AI. The power is there, and the excitement is justified. However, after many rounds of working with System Integrators on different code conversion efforts, BladeBridge is starting to see the emergence of the Slope of Enlightenment after such System Integrators have hit the Trough of Disillusion. Mixing the consistency of BladeBridge with the creativity of LLMs is a reliable accelerator in driving conversion productivity.

 

TO CONTINUE READING
Register Here

appears invalid. We can send you a quick validation email to confirm it's yours