Testing the untestable: How Virgin Money and PwC are building trust in GenAI

People in meeting in an office

Virgin Money is targeting the significant benefits GenAI will bring, from increased productivity to empowering its people – but not before thorough testing to address potential risks.

Virgin Money logo

Industry

Banking

Our role

GenAI testing and assurance

Featuring

GenAI in regulated sectors

“With nascent technology, one of the first things you want to do is test it, to make sure it’s safe to adopt,” says Nidhi Agarwal, Virgin Money’s chief model risk officer.

For a regulated, risk-conscious industry such as banking, adopting an AI tool to boost staff productivity can feel daunting. “Even as recently as a year or two ago, people were saying you can’t test GenAI or Large Language Models,” says Chris Heys, who leads PwC’s AI and Modelling practice for the banking sector.

Video

Virgin Money interview

4:16
More tools
  • Closed captions
  • Transcript
  • Full screen
  • Share
  • Closed captions

Playback of this video is not currently available

Transcript

This is how Virgin Money and PwC began their GenAI journey together; by testing the untestable. Underpinning Virgin Money’s approach to the emerging technology is its ethos of ‘smart disruption’. Rather than waiting for others to pilot generative artificial intelligence (GenAI) in banking, the company aimed to embed a GenAI-assistant productivity tool directly into employees’ workflows - so testing to understand the risks was crucial.

Examples of use cases that could save hours of Virgin Money’s people’s time every week include creating new documents, converting code, and writing up meeting minutes. But there is still a “huge range” of use cases Agarwal is excited to explore.

“This tool will evolve – as it starts learning more about Virgin Money’s policies, documents, standards and data, it will become richer, more intelligent, giving us more accurate outcomes which we can more easily use in our decision making. Though the human-in-the-loop will always be important.”

Nidhi Agarwal,Chief Model Risk Officer, Virgin Money

The real benefit is handing over repetitive tasks to a tool that can do them in a fraction of a second. “This will free people up to do what they’re really good at – applying human judgement in evaluating the information they’re given,” Agarwal continues.

Understanding and managing risks

However, developing and using Large Language Models (LLMs) calls for careful oversight to balance the benefits with the potential risks – nowhere more so than in the regulated banking sector. “Financial services, as an industry, is very conservative. It takes a cautious approach to adopting new technology,” says Agarwal.

The challenge was to devise a way of testing to understand how a GenAI assistant works, where it boosted productivity, where it faltered, and how to mitigate risks - especially in the light of new regulation, such as the EU AI Act. “We wanted to be able to say, you can test it, you can understand it - and if you can do that, you can build trust in it,” says Heys.

“While we’ve spent years testing complex financial models and machine learning models - GenAI and LLMs have taken model testing to a whole new level of challenge,” Heys continues. “Testing GenAI is a brand-new problem; it deals with words not numbers, it's humanlike, it gives a different answer every time, the datasets are massive.”

Together, PwC and Virgin Money developed a testing framework and innovated testing technology, which helped the bank validate and implement the GenAI tool for its staff. The joint team – of data scientists, technologists, AI experts, prompt model risk and model validation experts - ran live tests to build a full picture of errors, hallucinations, bias and toxicity, using AI to test AI - or ‘LLM As-a-Judge' technology.

This complemented the more traditional testing approaches using statistical techniques and specialist human testing experts. While using AI to test AI might sound counter-intuitive, it is based on the empirical observation that LLMs perform better at evaluating responses than they do at creating responses. Using LLM As-a-Judge helps to scale and automate testing, allowing full testing to be completed in five weeks.

Frontierland spirit

“It’s a very innovative technology,” says Agarwal, “but we shouldn’t forget the fact there is model risk involved in the use of this.” Indeed, after testing, the team found some of the risks could be easily controlled by staff, and that high accuracy levels depended on it being operated by well-trained people. The testing identified hallucinations in a few aspects of the GenAI tool that were outside of Virgin Money’s risk tolerance and so additional controls and guardrails were developed to address these risks - and satisfy regulatory due diligence.

The results of the team’s GenAI testing are now with the organisation’s AI Council, which brings together all the areas AI touches - testing, risk, commercial, compliance, and more - to weigh the value of each new use case.

With that ‘safe use’ framework in place, the business has laid the foundations for an AI-enabled future. “There’s no need to start from scratch, piloting each use case,” says Agarwal. She adds: “Virgin Money can now be much bigger, bolder and more ambitious in rolling out GenAI to all employees.”

With the right guidance and a better understanding of the GenAI assistant, staff trust in the tool is growing as the results improve. “It really felt like frontier work – showing that it can be done,” Heys says. “It was good being able to put GenAI in the hands of a whole organisation – democratise it, with everyone using it and starting to feel comfortable with it. Then you can see the productivity benefits coming through, right across the organisation.”

Chris Heys
It felt like really frontier work.

Even as recently as a year or two ago, people were saying, you can't really test GenAI – it's a black box - and myself and Nidhi, PwC and Virgin Money, we wanted to say, you can test it.

Nidhi Agarwal
Financial services as an industry is very conservative. It takes a cautious approach to adopting new technology. GenAI is quite nascent at the moment, whilst Virgin Money, one of the values is smart disruption, and we wanted to embrace it.

Chris Heys
One of the first GenAI apps that Virgin Money are going to release is a GenAI assistant productivity tool. Something that can help the HC department, the lending department, the risk department.

Nidhi Agarwal
The reason we got excited about it is because of the efficiency benefits that you can get with using generative AI, where instead of you having as a human to go through multiple documents to create a summary of that, or to find a particular content from those, if only you can ask a machine to do it for you in a fraction of a second.

Chris Heys
For years people like myself and Nidhi and the model risk profession, we've been testing the traditional financial models. But testing genAI, it's a brand-new problem: it's human-like, it gives a different answer every time, it deals with words, not numbers, the data sets are massive.

Nidhi Agarwal
With any nascent technology, one of the first things you want to make sure is it's safe to adopt it. We partnered with PwC to provide us the expertise to go through that process of safe and thorough validation.

Chris Heys
We broadly use three sorts of areas: the first area is human testing, the second area is using statistical tests - so these have been around for a long time. But the third area and I would say probably the most exciting is AI to test AI. So that’s LLM as a judge. And that enables us to really scale up the testing, to do thousands of tests for bias, for toxicity, for hallucination, to make sure it has the right quality, to make sure it's doing what we understand, and that it's got the right style and tone for the organisation.

Nidhi Agarwal
When we saw the responses from the LLM, we found that if we are asking it to respond to questions without referencing specific documents that were given to it, the error rate was much higher, around 60%. But then if you provide specific references to documents then the error rate reduces massively.

Chris Heys
We found it was quite secure, and we found that it could give robust and reliable answers lots of the time. On the flip side, occasionally it makes an error, occasionally it's not accurate. And you can paint a picture of when those errors happen and then you can put the appropriate guardrails and controls in place so you can move forward confidently and get all those benefits.

Nidhi Agarwal
Another use case that we have been testing the LLM on is summarisation of documents. If you're giving three or four documents and wanted to summarise the content into a short paragraph, is the quality of that paragraph good enough? We are quite satisfied with that.

Chris Heys
How it played out was a really collaborative team. We brought along our team from PwC, Virgin Money brought along a lot of their team. We had model risk experts, we had model validation experts, we had data scientists, we had technologists. So, we had a lot of ground to cover, but we were able to do that in four to five weeks.

Nidhi Agarwal
At the moment Virgin Money has completed the validations required. We understand the risks out of using these LLMs, and we are at a stage where we are creating user training for the safe roll-out of the tools.

Whilst human-in-the-loop is exceptionally important in any LLM use case, one thing that I do foresee will happen as the tool starts getting more trained on internal Virgin Money data, policies, it will become richer, more intelligent, and start giving us more accurate outcomes later on, which we can more easily use in our decision-making.

Chris Heys
I've really, really enjoyed this. I quite like the challenge of someone saying to me you can't test something and then working out and showing how to test it.

And I feel quite passionately that GenAI could be massive for my clients and for society as a whole.

Nidhi Agarwal
I am excited about the prospect at the bank and in the financial industry about use of GenAI because I do think a lot of work that we do is repetitive, can be automated, and leave us space and time to actually do things that we are really good at in terms of creative work or value-add work.

Our contributors:

Nidhi Agarwal

Chief Model Risk Officer,
Virgin Money

Chris Heys

Partner, PwC UK

Featured client stories

Mitsubishi UFJ Trust and Banking Corporation

Industry: Banking
Theme: GenAI

Learn more

Chapel Down

Industry: Wine
Theme: Transformation

Learn more

Artificial intelligence

Banking and capital markets

Contact us

Chris Heys

Chris Heys

Partner, PwC United Kingdom

Tel: +44 (0)7715 034667

Tim Chapman

Tim Chapman

Director, PwC United Kingdom

Tel: +44 (0)7701 296628

Follow us