Modernizing ETL: Migrating from SSIS to PySpark with Agentic AI
As data platforms evolve, many organizations are rethinking their dependence on legacy tools like SQL Server Integration Services (SSIS) and SQL server-based warehouses. SSIS has been a dependable workhorse for years. But the shift toward cloud‑native, scalable, and automated solutions, especially on platforms like Databricks has opened the door to modernization.
This blog lays out a structured approach for migrating extract, transform, load (ETL) workflows from SSIS to PySpark-based notebooks. The approach leverages AI agents orchestrated through frameworks like LangGraph to build an intelligent, modular migration utility. The focus is on the why, how, and key lessons—so teams planning a similar move can take away practical guidance to accelerate their migration journey.
Why Organizations Are Moving Away from SSIS
While SSIS remains a cornerstone of Microsoft’s data ecosystem, several challenges crop up in modern environments:
- Scaling Limitations: Built mainly for on-premise workloads, SSIS struggles with distributed, cloud-first data processing.
- Cloud-readiness Gaps: Connecting to modern data lakes or cloud platforms often needs custom workarounds.
- Automation Hurdles: Its GUI-driven nature makes DevOps and continuous integration (CI) and continuous development (CD) adoption cumbersome.
- Big Data Constraints: SSIS was never designed for large-scale or unstructured data.
These limitations are pushing organizations toward platforms like PySpark on Databricks, which are cloud-native, distributed, and automation-friendly, and better aligned with governance and lineage requirements.
Understanding the Legacy Before Migrating
One of the hardest parts of any migration isn’t writing new codek; it’s understanding what organizations already have.
Over the years, organizations accumulate hundreds or even thousands of SSIS packages, each with unique logic and dependencies. Much of the SSIS logic is embedded in XML, making it hard to interpret without specialized tools. Important transformations or business rules can be buried deep within package configurations. Documentation is often missing or outdated. Without clear documentation, understanding the purpose and flow of each package becomes a manual, error-prone process.
SSIS packages frequently call each other or depend on external scripts, databases, or files. Mapping these dependencies is challenging, and missing one can lead to migration failures. Many packages also use custom scripts or third-party components, which may not have direct equivalents in PySpark. For these, the logic needs to be rewritten—there’s no shortcut.
The challenge of analyzing and extracting logic manually from SSIS packages is that it is labor-intensive and increases the risk of missing critical business logic. Since SSIS captures specific business context in its packages, technical teams may not fully understand the rationale behind certain workflows, leading to misinterpretation during migration.
Due to these challenges, automating the understanding part becomes essential. When thinking about an automation utility, the first step should always be to consolidate the inventory and run a script-based analysis. This analyzer should process SSIS packages in bulk. Since SSIS usually embeds XML logic, the analyzer should convert this XML into queryable relational metadata. It should also highlight control flows, task sequences, dependencies, and transformation patterns present in the packages.
Bridging the Gap: Moving Toward PySpark
Once the legacy landscape is understood, the next step is translating the intent into PySpark. This is where complexity creeps in—handling task sequencing, data type mismatches, orchestration flows—but a structured approach makes it manageable.
Why Agentic AI Matters for Complex Code Modernization
Modernizing ETL from SSIS to PySpark is not a simple “lift and shift” migration. It involves interpreting intricate orchestration logic, embedded SQL, transformation patterns, and other complexities—often across hundreds of packages—and then rewriting them into corresponding PySpark code. Traditional Python script-based tools excel at as-is migrations, but they fail miserably in such scenarios.
This is where agentic AI serves as a catalyst for transformation. Through agentic AI, we break the problem into specialized agents, each responsible for a specific task. These agents work in tandem, orchestrated through agentic workflows, to tackle complexity in a modular, scalable way, ensuring context isn’t lost.
The next challenge is figuring out how to break the SSIS-to-PySpark migration into manageable steps that large language model (LLM) agents can handle. Usually, the approach starts by translating SSIS packages into documentation that outlines functionality, flow, and dependencies. This serves as a reference for both technical and business stakeholders and becomes input for other agents to generate code.
Once the document is created, AI agents can leverage details of data flow tasks (DFTs) and control flow tasks (CFTs) to generate PySpark transformations and extract orchestration logic including task sequencing, failure handling, looping constructs, and so on.
The generated document also contains embedded SQL logic with stored procedures, views, and table DDLs. Another agent can convert these into PySpark-native functions, ensuring business rules are preserved.
Finally, once all these tasks are converted, another agent merges these separate files into a cohesive PySpark pipeline that mirrors the original SSIS package’s intent and flow.
Through these agentic workflows, we are able to drive:
- Parallel execution: Multiple agents can process different aspects of migration simultaneously, reducing time-to-delivery.
- Context-aware orchestration: Agents share intermediate outputs, ensuring consistency across documentation, logic conversion, and pipeline generation.
- Extensibility: New agents can be added for emerging patterns or custom business rules without redesigning the entire system.
- Traceability: Outputs are stored in structured formats (e.g., Delta tables), providing auditability and lineage for every migration step.
Key Takeaways
Automation delivers the best results when the execution platform natively supports scale, lineage, and governance, keeping engineering time focused on business logic rather than plumbing. Some key takeaways from this approach include:
- Automate, but validate: Automated extraction dramatically speeds up discovery, but manual checks prevent surprises and preserve intent.
- Document everything: Clear documentation of workflows, mappings, and logic reduces maintenance costs and accelerates onboarding.
- Engage stakeholders early: Business users often hold crucial context about “why” certain workflows exist; involving them early prevents rework.
- Expect iteration: The first version won’t be perfect. Plan for refinement cycles to address edge cases and optimize performance.
Conclusion: More Than a Technical Shift
Migrating from SSIS to PySpark is not only a code transformation; it is a shift toward intelligent automation and agentic orchestration. Traditional modernization efforts often focus on code translation, but the real opportunity lies in reimagining ETL through the lens of agentic AI. By combining Mosaic AI for reasoning and LangGraph for workflow coordination, organizations can turn a complex, multi‑package estate into a structured, repeatable process. By leveraging the power of agentic AI, complex legacy systems can be modernized for scale, resilience, and future‑ready analytics. This approach enables teams to accelerate migrations, preserve business logic, and ensure every transformation is both explainable and traceable.
More from Parth Parkhani
What is a true data-powered organization? A data-first organization emphasizes using and managing…
Latest Blogs
We live in an era where data drives every strategic shift, fuels every decision, and informs…
The Evolution of Third-Party Risk: When Trust Meets Technology Not long ago, third-party risk…
Today, media and entertainment are changing quickly. The combination of artificial intelligence,…
In our first blog, we examined the looming risk posed by quantum computers to existing asymmetric…




