IG Group — KYC automation across 40+ languages: OCR + GPT-4 + multilingual NMT
Architected and deployed an end-to-end automated workflow for verification and processing of international KYC documents. OCR + GPT-4 + machine translation. Reduced onboarding by 35%, 89% accuracy across 40+ languages, 1,200+ documents processed. Full Compliance and Legal collaboration.
Context
IG Group is a global LSE-listed retail trading broker serving clients in the UK, US, Poland, Japan, Bermuda, Dubai and dozens of other jurisdictions. Every new client must pass KYC (Know Your Customer) — a process tightly regulated by the FCA (UK), CFTC (US), JFSA (Japan), DFSA (Dubai) and local equivalents. Documentation arrives in dozens of languages and many formats (passports, ID cards, utility bills, bank statements).
Problem
Pre-AI workflow:
- Each document had to be manually translated by a compliance officer or external translator.
- Mean onboarding time exceeded 2.7 days — long enough for prospects to drop off.
- Compliance was the bottleneck; in peak periods the queue grew into the thousands.
- Inconsistencies between regions: different people, different interpretations, different errors.
Solution
I architected an automated KYC pipeline combining three layers:
- OCR layer — text extraction from images/PDFs (passports, IDs, utility bills) handling rotation, low contrast, non-Latin scripts.
- Multilingual NMT layer — machine translation specialized for official documents, with rare language pairs (e.g., Thai, Arabic, Japanese).
- GPT-4 verification layer — structured field extraction (name, DOB, address, document number), document type classification, cross-document field consistency checks.
Compliance and Legal collaboration: every pipeline stage went through review for GDPR, FCA SYSC, EU AI Act readiness. Implemented per-call audit logging on GPT-4 requests, EU/UK data residency, opt-out from OpenAI training.
Outcomes
- Account opening time reduced by 35% — from 2.7 days to 1.75 days on average.
- 120+ hours of manual processing saved per month across all regions.
- 89% accuracy measured against a ground-truth dataset of 1,200+ documents in 40+ languages.
- Zero compliance incidents in the first year of production.
- Stakeholder satisfaction: compliance team freed for tasks requiring human judgment, sales reported a shorter lead-to-active path.
What I learned
- Multilingual is more than translation — proper nouns, date formats, diacritics need their own pipeline.
- Passport OCR is a specialized problem — generic Tesseract isn’t enough; specialized MRZ (Machine Readable Zone) models lifted accuracy ~12 points.
- GPT-4 in compliance demands determinism — temperature=0, JSON schema, strict output validation, fallback to human review on low confidence.
- Compliance/Legal as project co-owners, not blockers — included from day 1 they have better ideas than when invited at the end.