PaperKit · India-first KYC + document AI
An India-first KYC and document-AI extraction platform with an Ed25519 audit chain on every extraction — Aadhaar (UIDAI-masked), PAN, Voter ID, driving licence across 36 RTO layouts, passport MRZ, GST invoice with HSN line items, and bank-statement aggregation across 11 Indian banks. The incumbents (Karza-Perfios, Signzy, Hyperverge, IDfy, AuthBridge) ship a database-row audit trail; the foreign cloud OCR APIs (Textract, Document AI, Form Recogniser, Onfido) can't host inside India. PaperKit ships tamper-evident provenance, DPDP §13 deletion-certificates, and parity self-hosted + SaaS — with 2,601 software tests passing (pipeline/integration — not a measure of OCR or extraction accuracy).
01 — Who it's for
Banks, NBFCs, fintechs, brokers and InsurTech all process the same stack — Aadhaar, PAN, a bank statement, a GST invoice — and all carry the same 2026 gap: no cryptographic provenance on the verification step. The RBI's March 2024 revised Master Direction on KYC (RBI/2024-25/118) cited recurring KYC audit-trail deficiencies across its inspection cohort. PaperKit is the audit-chain-native extraction layer that closes it.
— ICP · 01 · Universal banks
800–1,100 onboardings/day across a branch cluster, 18–32 minutes per file, a 3–8% first-pass field-entry error rate, and inconsistent Aadhaar masking driving compliance findings. Needs sub-8-minute onboarding and an RBI-inspection-ready artefact.
— ICP · 02 · Fintechs & brokers
4,000–18,000 onboardings/day on a stack built fast for growth. Per-doc OCR fees from the incumbents compress margin; the audit trail is a Postgres row. RBI Payment Aggregator and SEBI audit expectations are tightening.
— ICP · 03 · NBFCs
Commercial-vehicle and equipment financing books growing 18–22%/yr, gated on KYC + GST invoice cross-verification. A single fraudulent GSTIN match drives a costly recovery case. Needs GSTIN validation + GSTN cross-check + HSN at the loan step.
— ICP · 04 · PSU banks · DPDP-strict
Published RFI scoping for self-hosted KYC AI; foreign-cloud residency conflicts with DPDP-aligned RBI cyber-security circulars. Needs air-gapped operation with the same audit-chain, schema set and DPDP-deletion flow as the SaaS tier.
02 — The pipeline
A document lands at the API gateway, gets tenant-scoped and rate-limited, and runs the same six-stage pipeline whether you call it over REST, gRPC, or in-process. Each external dependency — OCR engine, VLM, verification API — sits behind a Protocol + StubAdapter + ProductionAdapter, so tests run with zero credentials and production slots fill at deploy time.
For mixed scans, photos, PDFs
PDF page-split, image normalisation, orientation correction. PDF, JPG, PNG, TIFF, HEIC accepted; multi-page documents handled with per-page resolution.
For government-tender and RBI-inspector explainability requirements
A signature-based DocumentClassifier resolves each page to a document type (aadhaar, pan, gst_invoice, …) with an explainable "why" — the audit answer to "why did you classify this as PAN?".
For noisy Indic scans
Mistral OCR 3 by default (Tesseract for air-gapped). Below the per-tenant confidence floor (default 0.85), escalate to a VLM — Claude, Gemini, OpenAI, or local PaliGemma / Qwen — per your routing rules.
For UIDAI Reg 2021 §7
Per-document extractors emit schema-validated JSON with per-field confidence. Aadhaar is masked to XXXX-XXXX-NNNN with the Verhoeff checksum validated and the embedded photograph stripped — by default, every time.
For independent verification
Every extraction writes an AuditChainEntry — (prev_hash, payload_hash, ts, signer_key_fingerprint, signature) — Ed25519-signed by the per-tenant root key. The payload-hash sees only the masked form, never the raw Aadhaar.
For low-confidence reads
Below the floor, the extraction is not delivered as final — a task lands in the human-in-loop Review queue. Every confirm / correct / reject is logged as an OperatorCorrection in the chain.
03 — Document-type catalogue
India isn't a generic OCR region. Aadhaar has a Verhoeff checksum and a masking mandate; PAN has the
IT Act Rule 114 format; the driving licence has 36 RTO layouts; the GST invoice has GSTIN, HSN/SAC and
a three-tax split. Every extractor below is a real module with a schema in paperkit/schemas/
and a per-field confidence score.
vs manual, inconsistent masking
12-digit detection + Verhoeff checksum; first 8 digits masked to XXXX-XXXX-, last 4 retained; embedded photograph stripped per Reg 2021 §7. A structured KUA/AUA opt-out retains the full number in encrypted form, with the policy override audit-logged.
vs free-text key-in error
10-character extraction enforced against [A-Z]{5}[0-9]{4}[A-Z], optional NSDL PAN-verify cross-check for status (valid / inactive / cancelled), and a name cross-field consistency check against the Aadhaar in the same batch.
vs state-format sprawl
EPIC number with state-specific prefixes, voter name, father/spouse name, address, age, gender and constituency — with address normalisation against the Indian district + pincode normaliser.
vs one regex per state
DL number, name, DOB, vehicle classes (MCWG / LMV / HMV / …), issue + expiry dates with an expired-DL flag, and address normalisation across all 36 state and UT formats. ISO 18013 MRZ read where post-2024 DLs carry one.
vs hand-keyed MRZ
Two-line MRZ on the data page; check-digit validation on document number, DOB, expiry and the composite. Country code, given names, surname, sex and expiry extracted. Visa-page extraction for non-Indian passports.
vs GSTIN spoofing in lending
15-character GSTIN validation per CGST §25(7) + Rule 10, optional GSTN-API cross-check (active / cancelled / suspended), line-item extraction with HSN/SAC against the CBIC list, and a CGST + SGST + IGST + cess computation cross-check that flags total mismatches.
vs per-bank PDF chaos
Dedicated parsers for HDFC, ICICI, SBI, Axis, Kotak, Yes, IndusInd, IDFC First, Federal, Canara and Bank of Baroda. Header, opening/closing balance, IFSC/MICR, and per-transaction date / narration / debit / credit / balance with a balance-reconciliation check. Roadmap (Q3 2026): PNB, Union Bank of India, and Indian Bank.
vs MICR re-keying
CTS-2010 MICR-line decode, payee name, amount in figures and in words (cross-checked), signature-presence flag, post-dated detection and account-payee crossing detection.
vs manual income proofs
Form 16 (gross salary, deductions, taxable income, tax paid), Form 26AS (section-wise TDS, deductor TANs, refund status), and ITR-V (ITR number, assessment year, status, refund) — with a Form 16 ↔ 26AS cross-check.
Every extraction returns schema-validated JSON (registry in paperkit/schemas/), a per-field
confidence in [0,1], a PII-safe audit_payload that never contains a raw Aadhaar, account
number or name, and a hash-linked Ed25519 audit-chain entry. Non-standard forms (a KCC form, a corporate
vendor-onboarding form) register through the low-code schema builder — paperkit schema register.
04 — REST · gRPC · webhooks · CLI
X-PaperKit-Api-Key on every request; per-tenant token isolation at every handler. Submit
a document, receive the finished result on an HMAC-signed webhook (X-Paperkit-Signature)
or fetch it by id. High-throughput callers use the gRPC surface; ops and air-gapped deployments use the
CLI. The same masking and audit-chain hand-off runs on all four.
# 1. Confirm the key, then submit a scanned Aadhaar for extraction. curl -H "X-PaperKit-Api-Key: $PAPERKIT_KEY" \ https://kyc.yourbank.internal/v1/whoami # → { "tenant_id": "ten_yourbank_prod", "role": "kyc_operator" } curl -X POST https://kyc.yourbank.internal/v1/extractions \ -H "X-PaperKit-Api-Key: $PAPERKIT_KEY" \ -H "Content-Type: application/json" \ -d '{ "document_type_hint": "aadhaar", "input_object_ref": "s3://paperkit-ten_yourbank_prod/inbox/cust88421.pdf", "input_size_bytes": 248194, "input_pages": 1 }' # → 202 Accepted { "request_id": "req_…" }
# The masked result arrives on your signed webhook. # Verify X-Paperkit-Signature (HMAC-SHA256) before trusting the body. { "document_type_resolved": "aadhaar", "masked_aadhaar": "XXXX-XXXX-9012", "full_aadhaar": null, # never returned unless KUA/AUA opt-out "photograph_stripped": true, # UIDAI Reg 2021 §7 "aadhaar_masking_applied": true, "confidence_aggregate": 0.97, "audit_entry_id": "ace_…" # Ed25519-signed, hash-linked }
# Ops + air-gapped: the CLI runs init, extract, and the cert flows. paperkit init --tenant-id ten_yourbank_prod --tier business \ --hosting-region yotta_mumbai --dpdp-data-class financial paperkit extract ./cust88421.pdf paperkit verify-chain paperkit dossier 2026-04-01..2026-06-30 # DPDP §17 audit dossier
CLI surface: init, extract, batch, verify-chain,
deletion-cert, dossier, bsa-cert, schema register.
Batch ingest accepts up to 1,000 files with a per-tenant concurrency limit (default 8, up to 64 on
Enterprise). Webhook delivery retries on exponential backoff — 1m, 5m, 15m, 1h, 6h, 24h (first attempt plus six retries) — and each
terminal delivery is audit-chain-anchored.
05 — Redaction in the pipeline
Masking is not a post-processing toggle — it is a default in the extractor. The audit-chain payload-hash is computed over the masked form, so even the tamper-evident record can't leak the raw number. Only a UIDAI-empanelled KUA/AUA can flip the opt-out, and that override is itself an audit event.
from paperkit.extractors.aadhaar import mask_aadhaar, find_aadhaar_candidates # UIDAI-compliant masking: first 8 digits → X, last 4 retained, # formatted "XXXX-XXXX-NNNN" (Aadhaar Reg 2021 §7). masked = mask_aadhaar("123456789012") assert masked == "XXXX-XXXX-9012" # Detection runs over OCR text; every 12-digit candidate is # Verhoeff-checked. A failing checksum is flagged, never silently kept. for hit in find_aadhaar_candidates(ocr_text): print(hit) # candidates surfaced for masking, not retention # What lands in the audit chain (PII-safe): # audit_payload sees the HASH + masked form only — # never full_aadhaar, account numbers, or names.
The same discipline runs across the catalogue: bank-statement account numbers are stored masked, the cheque MICR is decoded but the audit payload stays PII-safe, and every consent capture is an Ed25519-signed artefact under DPDP §6 that is revocable through the data-principal rights API.
06 — DPDP, RBI & the audit chain
PaperKit's centre of gravity is provenance. Every extraction is Ed25519-signed and hash-linked into a per-tenant chain; the chain root anchors monthly to a public certificate-transparency log so PaperKit itself cannot retroactively edit a past entry. Tampering with any past entry breaks verification. Your internal audit team verifies it without ever contacting us.
vs a database row update with no proof
Right-to-erasure issues a signed PDF certificate citing every deleted extraction ID + hash, the request reference, and the two approver IDs. The delete is irreversible (primary + replicas + object store + cache); the certificate is itself audit-chain-anchored. An empty result still issues a valid certificate — proof you checked.
vs PDF scans + Finacle log entries
A single-day query produces a signed dossier listing every relevant audit entry, the result of verifying the whole chain, and the chain's pinned state at generation time. Hand it to the Data Protection Board, an RBI inspector, or internal audit.
vs inadmissible paper trails
Any extraction exports as a four-section BSA §63 evidence bundle, counter-signed by the tenant root key and the operator's key, with a dual IT Act §65B(4) section for the transition window — producing a tamper-evident, BSA §63-aligned evidence bundle engineered to satisfy the technical requirements (hash, algorithm, chain of custody, device identity, operator). Admissibility in any proceeding is determined by the court.
vs trust-us audit logs
Per-tenant Ed25519 root key (HSM-backed in cloud; BYOK on Enterprise). Each entry hash-links to the previous; the root anchors monthly to a public CT log. paperkit verify-chain walks every signature — green "OK", or a "TAMPER DETECTED" with the offending entry id.
vs FY25 "KYC audit-trail gaps" findings
Every OVD type (Aadhaar, passport, DL, Voter ID) extractable with an audit anchor (Para 16); per-tenant retention enforcement (Para 38); tamper-evident reproduction for inspection (Para 39); a re-KYC trigger interface (Para 45). Aligned with PMLA 2002 + PMLA Rules 2005.
vs standalone OCR with no source-of-truth
Seven first-party adapters: DigiLocker source-attested fetch, UIDAI offline-XML parse, NSDL PAN verify, GSTN GSTIN verify, Sahamati Account Aggregator, and e-Sign via eMudhra or NSDL — each a Protocol + StubAdapter + ProductionAdapter, each call audit-chain-anchored.
Two-operator approval gates the irreversible and the sensitive: DPDP §13 deletion-cert issuance, Ed25519 key rotation, cross-tenant schema sharing, operator role elevation, and white-label provisioning. Both operators must affirm within a tenant-configurable approval window (default 60 minutes) or the request times out; either rejection cancels; the approval record is in the chain. Certifications held: ISO 9001:2015, ISO 27001 (certificate available on request). In progress (targeted): CERT-In empanelment, STQC, SOC 2 Type II. Supply-chain clean-room option: excludes Baidu/PaddleOCR dependencies per tenant flag, for customers with Baidu supply-chain restrictions.
07 — Operator console
The console is server-rendered FastAPI — no SPA build, no Node toolchain, runs anywhere Python runs, on the bank's network. Role-based access (Extractor, KYC Operator, Audit Reviewer, DPO, Tenant Admin, Cloud Admin, Read-Only Auditor) means a person never sees a button their role can't use. Designed to meet WCAG 2.2 AA (formal audit scheduled Q3 2026); English + Hindi at GA. Roadmap (GA+6 months): eight additional Indian languages.
KPI tiles, recent extractions, and the open-exception count — everything a KYC ops head needs in the first ten seconds after login.
Filter by document type; click into a result to see typed fields, per-field confidence bars, the masked Aadhaar, and the linked audit-chain entry. Dedicated bank-statement view for reconciled statements.
My queue / Unassigned / Team queue. Source crop beside the read value with a correction box; one-click Confirm / Save correction / Reject-with-reason. Every action logged in the chain.
Run verify-chain on a schedule (green "OK" / red "TAMPER DETECTED"), generate a DPDP §17 dossier over a date range, and produce a BSA §63 evidence certificate for a single extraction.
Enter the data-principal reference, review the held records, first + a different second approver affirm, then Issue certificate. The system rejects matching operator ids.
Browse PaperKit-shipped schemas and tenant custom schemas, register a low-code YAML/JSON-Schema template, and pin versions — every schema change audit-chain-anchored.
08 — Deployment
Every feature on SaaS is available self-hosted — same audit chain, same schema set, same DPDP-deletion flow. That parity is the line no incumbent crosses: Karza-Perfios, Signzy, Hyperverge, IDfy, AuthBridge, Textract, Document AI, Form Recogniser and Onfido are all SaaS-only or SaaS-primary.
SaaS · India-resident
Self-hosted · air-gap-capable
An optional 30-second health beacon (opt-in for self-hosted) reports deployment id, software version, throughput, error rate and resource utilisation, with PagerDuty-compatible alert routing. Multi-tenant isolation runs to the metal: per-tenant Postgres schema, object-store bucket, encryption key, Ed25519 root key and worker queue — no admin endpoint returns cross-tenant data.
09 — How to engage
Pilots are short and structured — bank pilots run 90–180 days, fintech 30–60 — on your own document corpus with the full audit chain, DPDP §13 deletion-certificate, and India-deep schema set included. SaaS (India-resident) and self-hosted (air-gapped) options are both available. Reach out and we will scope the right engagement for your volume and compliance requirements.
10 — Start
A 90-day pilot on your own document corpus — the audit chain, the DPDP §13 deletion-certificate, and the India-deep schema set the incumbents don't ship. Self-hosted or India-resident SaaS.