Predicting Quote Conversion in Insurance

An AI-powered tool that helps insurance underwriters focus on the quotes most likely to convert.

Machine Learning API Python Database AWS

Overview

This project began after a series of brainstorming sessions with an insurance client, who wanted to explore new ways to capitalise on the data captured by their insurance marketplace platform.

One idea stood out, a tool that would let users upload their current pipeline of quotes and instantly see predictions on the likelihood of each quote binding.

At the time, only around 10% of submitted quotes were converting, meaning large amounts of time were being spent chasing low-probability deals.

We set out to build an AI model, API, and simple web interface that could quickly process quote data and present ranked predictions.

The result was a solution that prioritised each quote by its likelihood to convert, enabling brokers and underwriters to focus on the opportunities that matter most.

Data Extraction & Assembly

Before any analysis could take place, we first needed to collect and prepare the data.

Quote information was spread across several SQL tables, with some important fields, such as limits and premiums, stored as embedded JSON arrays within columns. To create a unified dataset, we used SQL to combine and “explode” these nested structures, transforming them into a consistent, tabular format suitable for analysis.

Once assembled, the combined dataset captured several years of quote history across multiple products and countries, providing a solid foundation for exploratory analysis and model development.

Exploratory Data Analysis

We explored the dataset in a Jupyter notebook using Python, combining visual inspection with statistical methods to understand relationships between features and the target variable.

Our first step was to evaluate basic distributions, missing values, and outliers. Summary statistics and boxplots quickly highlighted inconsistencies in numeric fields, including several quotes with unusually large premium values that appeared to be data entry errors.

Boxplots immediately showed issues with heavily skewed data for premiums and revenue, which is to be expected for financial data. You can see below that there are many values outwith the whiskers of the plots.

Boxplot of Premium (raw) — Premium distribution is highly skewed.

Boxplot of Revenue (raw) — Revenue distribution is highly skewed.

Boxplot of Limit (raw) — Coverage limits are right-skewed but the box looks good.

To understand the predictive power of individual features, we calculated univariate separability scores using the Area Under the ROC Curve (AUC). This revealed that a field had an unusually high correlation with the outcome, indicating potential data leakage. After further review, we confirmed that the field value was only present after a quote was bound and should be removed.

We discovered that one product type behaved differently from all others. All quotes for this product were consistently binding, and including them would have dominated the model, causing it to focus on this single, perfectly predictable pattern instead of learning the more subtle relationships across other product types.

Finally, we analysed the industry variable, which was encoded using NAIC industry classification codes. The codes were found to be too granular, leading to hundreds of rarely occurring categories.

Through this analysis, we developed a deeper understanding of the data and identified the transformations required to make it reliable and predictive. The EDA process also validated our feature assumptions and helped guide the next phase of model development.

Correlation analysis, detecting multicollinearity.

Data Preparation & Feature Engineering

Following the exploratory phase, we prepared the data for modelling by applying the insights gathered during analysis. The goal was to create a clean, consistent, and informative dataset that captured meaningful business patterns without introducing data leakage.

We began by removing problematic and irrelevant rows identified during EDA, including quotes with a product type that behaved differently, consistently binding. Several entries with implausibly high premium values were also removed. Columns that risked leaking the outcome, such as post-bind status indicators, were also excluded from the training set.

Categorical variables, including product type and country, were standardised and encoded using one-hot encoding to ensure they were compatible with machine learning algorithms. The industry NAIC codes were too granular for the available data, so we grouped them into broader industry categories and then one-hot encoded.

Feature Engineering

To improve model performance and interpretability, a shared feature engineering pipeline was built to convert raw quote data into structured, model-ready features. The same transformations are applied consistently at both training and inference time, ensuring alignment between model development and production scoring.

Temporal Features

Quote creation timestamps were used to derive several time-based indicators capturing seasonality and operational behaviour.

Feature	Description
created_dow	Day of week the quote was created (0–6).
created_hour	Hour of day of creation, capturing working-hour effects.
created_month	Month of year to identify seasonal trends.
is_weekend / is_business_hours / is_morning / is_end_of_month	Binary indicators for operational timing patterns.

Financial Normalisation and Scaling

Financial magnitudes were highly skewed, so log transformations were introduced to stabilise the data and improve comparability.

Feature	Transformation	Purpose
revenue_log1p	log(1 + revenue)	Compress extreme values and improve scale.
premium_log1p	log(1 + premium)	Compress extreme values and improve scale.

After applying log transformations to premium and revenue, the boxplots show a healthy distribution for modelling.

Boxplot of Premium (log1p) — Boxplot looking much healthier for premiums.

Boxplot of Revenue (log1p) — Boxplot looking much healthier for revenue.

Ratio and Relational Features

Ratios between premiums and limits were added to provide context on pricing and risk appetite.

Feature	Description
premium_to_limit	Ratio of premium to insured limit.

Industry Grouping

Industry codes were mapped to higher-level sectors to reduce sparsity and improve generalisation across unseen data.

Feature	Description
industry_sector_name	Mapped from NAICS-style codes to broad sectors such as Manufacturing or Finance & Insurance.
industry_subsector_grouped	Rare subsectors grouped under “Other” based on minimum sample thresholds.

Before training, the data was split into training and validation sets using stratified sampling to maintain the class balance of bound and non-bound quotes. This ensured the evaluation results reflected real-world behaviour and not random distribution effects.

The resulting dataset was clean, structured, and model-ready, providing a reliable foundation for the development and evaluation of predictive models.

Model Development & Evaluation

With the dataset prepared and validated, the next step was to identify which modelling approach would best capture the patterns that influence whether a quote binds. We evaluated a range of supervised classification algorithms, balancing model complexity, interpretability, and performance.

Model Selection

Because the dataset combined numeric, categorical, and engineered ratio features, we focused primarily on tree-based ensemble methods known for their ability to model non-linear relationships and handle mixed feature types with minimal preprocessing. For benchmarking, we also included a set of baseline models, linear and kernel-based methods, to illustrate the performance gap between traditional and gradient-boosting techniques.

Model	Type
LightGBM	Gradient Boosting (Histogram-based)
XGBoost	Gradient Boosting (Tree-based)
Gradient Boosting (sklearn)	Baseline Boosting
Random Forest	Bagging Ensemble
Logistic Regression	Linear Model
Support Vector Machine	Kernel Method

Model Comparison Routine

To ensure fairness across experiments, all models were trained using the same stratified train/validation split and evaluated on identical data. A custom model comparison script automated this process, standardising metrics, reproducibility settings, and evaluation outputs.

Each model was trained with a fixed random seed to guarantee repeatable results and tuned using a consistent set of hyperparameters optimised for generalisation rather than overfitting. The evaluation script computed key performance metrics: ROC-AUC, Precision-Recall AUC (PR-AUC), and Lift@50- providing a balanced view of ranking quality and calibration.

The table below summarises the final model comparison results:

Final Rankings (by PR-AUC)

Model	ROC-AUC	PR-AUC	Lift@50	Top50 Binds
LightGBM	0.928911	0.702337	8.594118	15
XGBoost	0.952065	0.675751	8.021176	14
Gradient Boosting	0.919024	0.611051	8.021176	14
Random Forest	0.917647	0.439035	7.448235	13
Logistic Regression	0.537547	0.047643	2.291765	4
SVM	0.481977	0.043755	0.572941	1

The LightGBM model achieved the best overall balance of recall and precision, with a PR-AUC of 0.70 and over eight-fold lift at the top 50 predictions compared to random chance. Although XGBoost recorded a slightly higher ROC-AUC, LightGBM demonstrated superior precision-recall performance, making it the preferred choice for deployment.

The results also illustrate the strength of gradient-boosting techniques on structured business data and the limitations of traditional linear approaches in this context. Logistic Regression and SVM models struggled to capture the complex interactions between features, while ensemble tree methods handled them naturally.

LightGBM was selected as the production model for its combination of high accuracy and speed.

The Solution

We packaged the model into a simple, production-ready API that fits seamlessly into the client’s workflow:

API: A lightweight FastAPI endpoint that accepts a CSV of new quotes and returns an ordered list ranked by bind probability.
Demo Interface: A simple tool to demo the API functionality that allows users to drag and drop their CSV file to view ranked results.

Tool Screenshot

What’s Next

We’ll continue to collect data and periodically re-visit the model comparisons to ensure we’re squeezing the largest amount of predictive power out of it.

We also plan to introduce interpretability dashboards to help users understand which factors most influence each prediction, building transparency and trust in AI assisted decision making.

Key Takeaways

Identified high probability quotes using AI ranking
Fast, self contained API and upload tool
ROC-AUC: 0.93 and PR-AUC: 0.7 – strong predictive accuracy
Improved broker efficiency and decision focus

Predicting Quote Conversion in Insurance

Overview

Data Extraction & Assembly

Exploratory Data Analysis

Premium

Revenue

Limit

Data Preparation & Feature Engineering

Feature Engineering

Temporal Features

Financial Normalisation and Scaling

Premium log1p

Revenue log1p

Ratio and Relational Features

Industry Grouping

Model Development & Evaluation

Model Selection

Model Comparison Routine

Final Rankings (by PR-AUC)

The Solution

What’s Next

Key Takeaways

Case Studies

Improving Data Quality in Insurance Quoting with AI

Predicting Quote Conversion in Insurance

High-performance, cloud-based voting platform