How We Used AI/ML To Ease Debt Risk Management For An Acclaimed Cancer Center

DevOps AI/ML Ops
DevOps AI/ML Ops

A Prominent Cancer Center in the US



CI/CD | Cloud | Containers

CI/CD | Containers | Cloud

GoCD | Docker | Rancher


Amazon Web Services | Rancher | Cattle Orchestrator


Enabling Early Debt Risk Identification

Our team used AI/ML techniques along with DevOps principles to help a prominent US-based cancer center quickly identify patients with high debt risk. This enabled the center to intervene early and offer payment plans that could help them minimize bad debt.


About the customer

The customer is one of the largest US-based centers dedicated to cancer research and treatment, and consistently ranks among the Best Hospitals for Cancer Care according to the US News and World Report. The center also offers education on cancer prevention and treatment to students, trainees, professionals, and the public.


Business Challenge

With a high volume of patients visiting the center each year, the customer would accumulate millions of dollars in unpaid patient accounts. Their objective was to better manage this “bad debt”. They needed to be able to identify patients with high debt risk earlier, so that they could intervene and offer them an appropriate financial plan.

The best way to do this was by using AI/ML techniques to predict debt risk well in advance.




The solution we offered implements a “training module” and a “prediction module” that run every week with new sets of data. Training is the phase where new data is ingested into an ML model. This is later used by the prediction module to identify and forecast debt risk cases among new patients. This also involves getting data from third-party sources like credit institutions.

01. Gathering Data from Vendors

02. Training the Model

03. Generating Predictions04. Failure Scenarios

To feed the training model we need data from different sources. These sources are trusted vendors who share data via subscriptions to a pull-based system. This is stored as Big Data.

This Big Data from the previous step is unclassified. To train the model, we need to classify it and feed it to the training model.

The same ML model is applied to generate the prediction based on the trained data. This is also handed over to the application.Each step involves numerous hours and days, and is prone to failure due to network, broken data and downtime of different components.


We broke the entire problem down into three distinct phases.

Automating: The first phase was to set up automation for getting data from third-party vendors. Each of those vendors has different mechanisms to provide data. Some vendors run Cron jobs and push it onto a commonly shared computer in the Cloud. A few others offer API endpoints that can be scraped via pipelines. This data is stored in different databases, including Hadoop containers. All this is done within GoCD in different stages as pipelines combining all the endpoints needed. The containers are run sequentially based on the success of each step. We used Rancher with Cattle to facilitate this flow.

Modeling: In the second phase, we get the training model that is applied to these different data sets. The model is then equipped with enough data to run its prediction. This is done as a different stack within Rancher.

Predicting: Finally, the third phase is implemented to generate predictions. This depends on the new registrations from the confidential patient data. As this is highly confidential data, we have employed vaults and encryption of the attributes, which is facilitated by Rancher by default.


GoCD Pipeline

GoCD Pipeline

The GoCD pipeline gets triggered every week and starts its sequential execution to get a new set of data. Each job is run asynchronously as a container so that the failure of any job doesn’t impede the next. Failures are reported to the respective Slack channels, and jobs can be re-triggered once the problem is identified.



At the time of implementing this project, Rancher was a robust orchestrator with an intuitive way of scheduling containers and defining a flow based on simple “docker-compose” files. Rancher’s scheduler, host agnostic scaling capacities, and visible logging of each container service gave us an easy way to run “train” and “prediction” models.



The loads, including Rancher, databases, and GoCD, were hosted on AWS and continuously monitored using CloudWatch so that we could scale them accordingly. A comprehensive notification system alerted us to all changes.

Along with this setup, we applied software development and deployment principles to an ML model so that we could use it with greater ease, versioning the results and reproducible code to replicate the setup. We also set up monitoring and alerting features for seamless delivery of the pipeline. With continuous improvements being made to AI/ML Ops technology , we’re also re-architecting the solution for a reduced time to deliver.



By consuming the prediction data and having versioned datasets, the application was able to deliver results faster. And with smaller, asynchronous jobs in the GoCD pipelines, failures could also be attended to without delays.


Faster Resolutions Via Pipelines

Faster Resolutions Via Pipelines

With asynchronous data gathering via pipelines, we gained the opportunity to track multiple vendors along with any data inconsistencies and address them individually.


Version Controled Models

Version Controled Models

Previously, it was a challenge to have the models and algorithms correctly versioned. Applying DevOps principles brought everything under control and facilitated higher value creation.


Independent Execution Via Containers

Independent Execution

Containerizing every component—from pipeline data scraping to launching the train and predict programs—created isolated steps which helped us debug the system as a whole.


Project Highlights

Block Tabs

Resilient Data Collection

Instead of having a single big pipeline for all the data sources, we used multiple data pipelines, which allowed us to identify failures with certain vendors and then re-trigger the data collection. This saved a lot of time in identifying the specific source of failure and enabled faster debugging.

Versioned ML Models

Previously, the ML models had been maintained in the form of Jupyter Notebooks saved in folders, which over a period of time, resulted in multiple unorganized folders.

By applying software engineering principles, we were able to find ways to version control the models, and educate data scientists and engineers about the need for doing so. Version control allowed us to keep track of how an ML model evolved right from day one, while also simplifying collaboration over particular ML models.

Eliminated A Single Point of Failure

Our approach relies on data collection, training models, and prediction of new data being done in stages within Docker containers. Besides having alerts set up for each, we watched failure points closely. This gave us more control over the components and allowed us to debug smaller events instead of them causing a total system failure.