PaperKit India KYC + document AI · audit-chain native

PaperKit · India-first KYC + document AI

Every extraction, signed.
Every deletion, certified.
Every audit, ready.

An India-first KYC and document-AI extraction platform with an Ed25519 audit chain on every extraction — Aadhaar (UIDAI-masked), PAN, Voter ID, driving licence across 36 RTO layouts, passport MRZ, GST invoice with HSN line items, and bank-statement aggregation across 11 Indian banks. The incumbents (Karza-Perfios, Signzy, Hyperverge, IDfy, AuthBridge) ship a database-row audit trail; the foreign cloud OCR APIs (Textract, Document AI, Form Recogniser, Onfido) can't host inside India. PaperKit ships tamper-evident provenance, DPDP §13 deletion-certificates, and parity self-hosted + SaaS — with 2,601 software tests passing (pipeline/integration — not a measure of OCR or extraction accuracy).

2,601
Software tests
91 test files · pipeline/integration · all green
9
Document extractors
11 schemas · Aadhaar · PAN · DL · passport · GST · …
11
Bank-statement parsers
Private-sector-first · PNB & Union Bank Q3 2026
36
RTO DL layouts
Every state + UT licence format
Ed25519
Audit chain
Hash-linked · monthly CT-log anchor
§13 / §17
DPDP 2023
Deletion-cert + audit-dossier built in
India
Data residency
SaaS or air-gapped self-hosted

01 — Who it's for

Built for the KYC desk that has to answer the regulator.

Banks, NBFCs, fintechs, brokers and InsurTech all process the same stack — Aadhaar, PAN, a bank statement, a GST invoice — and all carry the same 2026 gap: no cryptographic provenance on the verification step. The RBI's March 2024 revised Master Direction on KYC (RBI/2024-25/118) cited recurring KYC audit-trail deficiencies across its inspection cohort. PaperKit is the audit-chain-native extraction layer that closes it.

— ICP · 01 · Universal banks

KYC Operations Head

800–1,100 onboardings/day across a branch cluster, 18–32 minutes per file, a 3–8% first-pass field-entry error rate, and inconsistent Aadhaar masking driving compliance findings. Needs sub-8-minute onboarding and an RBI-inspection-ready artefact.

— ICP · 02 · Fintechs & brokers

KYC Ops Lead

4,000–18,000 onboardings/day on a stack built fast for growth. Per-doc OCR fees from the incumbents compress margin; the audit trail is a Postgres row. RBI Payment Aggregator and SEBI audit expectations are tightening.

— ICP · 03 · NBFCs

Credit Operations

Commercial-vehicle and equipment financing books growing 18–22%/yr, gated on KYC + GST invoice cross-verification. A single fraudulent GSTIN match drives a costly recovery case. Needs GSTIN validation + GSTN cross-check + HSN at the loan step.

— ICP · 04 · PSU banks · DPDP-strict

CISO / DPO

Published RFI scoping for self-hosted KYC AI; foreign-cloud residency conflicts with DPDP-aligned RBI cyber-security circulars. Needs air-gapped operation with the same audit-chain, schema set and DPDP-deletion flow as the SaaS tier.

02 — The pipeline

Ingest → classify → extract → mask → sign.

A document lands at the API gateway, gets tenant-scoped and rate-limited, and runs the same six-stage pipeline whether you call it over REST, gRPC, or in-process. Each external dependency — OCR engine, VLM, verification API — sits behind a Protocol + StubAdapter + ProductionAdapter, so tests run with zero credentials and production slots fill at deploy time.

Step · 01

For mixed scans, photos, PDFs

Ingest & pre-process.

PDF page-split, image normalisation, orientation correction. PDF, JPG, PNG, TIFF, HEIC accepted; multi-page documents handled with per-page resolution.

paperkit.pipeline.preprocess
Step · 02

For government-tender and RBI-inspector explainability requirements

Classify — per page.

A signature-based DocumentClassifier resolves each page to a document type (aadhaar, pan, gst_invoice, …) with an explainable "why" — the audit answer to "why did you classify this as PAN?".

paperkit.pipeline.document_classifier
Step · 03

For noisy Indic scans

OCR — Mistral, then fallback.

Mistral OCR 3 by default (Tesseract for air-gapped). Below the per-tenant confidence floor (default 0.85), escalate to a VLM — Claude, Gemini, OpenAI, or local PaliGemma / Qwen — per your routing rules.

paperkit.ocr.routing
Step · 04

For UIDAI Reg 2021 §7

Extract & mask.

Per-document extractors emit schema-validated JSON with per-field confidence. Aadhaar is masked to XXXX-XXXX-NNNN with the Verhoeff checksum validated and the embedded photograph stripped — by default, every time.

paperkit.extractors.aadhaar.mask_aadhaar
Step · 05

For independent verification

Sign into the audit chain.

Every extraction writes an AuditChainEntry(prev_hash, payload_hash, ts, signer_key_fingerprint, signature) — Ed25519-signed by the per-tenant root key. The payload-hash sees only the masked form, never the raw Aadhaar.

paperkit.audit_chain.chain · signer
Step · 06

For low-confidence reads

Triage, don't guess.

Below the floor, the extraction is not delivered as final — a task lands in the human-in-loop Review queue. Every confirm / correct / reject is logged as an OperatorCorrection in the chain.

paperkit.workflows.exception_triage

03 — Document-type catalogue

Nine extractors, eleven schemas. Each its own module.

India isn't a generic OCR region. Aadhaar has a Verhoeff checksum and a masking mandate; PAN has the IT Act Rule 114 format; the driving licence has 36 RTO layouts; the GST invoice has GSTIN, HSN/SAC and a three-tax split. Every extractor below is a real module with a schema in paperkit/schemas/ and a per-field confidence score.

01 · Aadhaar

vs manual, inconsistent masking

UIDAI-masked — Verhoeff-validated.

12-digit detection + Verhoeff checksum; first 8 digits masked to XXXX-XXXX-, last 4 retained; embedded photograph stripped per Reg 2021 §7. A structured KUA/AUA opt-out retains the full number in encrypted form, with the policy override audit-logged.

paperkit.extractors.aadhaar
02 · PAN

vs free-text key-in error

IT Act Rule 114 format.

10-character extraction enforced against [A-Z]{5}[0-9]{4}[A-Z], optional NSDL PAN-verify cross-check for status (valid / inactive / cancelled), and a name cross-field consistency check against the Aadhaar in the same batch.

paperkit.extractors.pan
03 · Voter ID (EPIC)

vs state-format sprawl

10-character EPIC + address.

EPIC number with state-specific prefixes, voter name, father/spouse name, address, age, gender and constituency — with address normalisation against the Indian district + pincode normaliser.

paperkit.extractors.voter_id
04 · Driving licence

vs one regex per state

36 RTO layouts — expiry flagged.

DL number, name, DOB, vehicle classes (MCWG / LMV / HMV / …), issue + expiry dates with an expired-DL flag, and address normalisation across all 36 state and UT formats. ISO 18013 MRZ read where post-2024 DLs carry one.

paperkit.extractors.dl
05 · Passport MRZ

vs hand-keyed MRZ

ICAO 9303 Type 3 — check digits.

Two-line MRZ on the data page; check-digit validation on document number, DOB, expiry and the composite. Country code, given names, surname, sex and expiry extracted. Visa-page extraction for non-Indian passports.

paperkit.extractors.passport_mrz
06 · GST invoice

vs GSTIN spoofing in lending

GSTIN + HSN line items.

15-character GSTIN validation per CGST §25(7) + Rule 10, optional GSTN-API cross-check (active / cancelled / suspended), line-item extraction with HSN/SAC against the CBIC list, and a CGST + SGST + IGST + cess computation cross-check that flags total mismatches.

paperkit.extractors.gst_invoice
07 · Bank statement

vs per-bank PDF chaos

11 banks — reconciled.

Dedicated parsers for HDFC, ICICI, SBI, Axis, Kotak, Yes, IndusInd, IDFC First, Federal, Canara and Bank of Baroda. Header, opening/closing balance, IFSC/MICR, and per-transaction date / narration / debit / credit / balance with a balance-reconciliation check. Roadmap (Q3 2026): PNB, Union Bank of India, and Indian Bank.

paperkit.bank_parsers · hdfc · icici · sbi · …
08 · Cheque

vs MICR re-keying

MICR + payee + amount.

CTS-2010 MICR-line decode, payee name, amount in figures and in words (cross-checked), signature-presence flag, post-dated detection and account-payee crossing detection.

paperkit.extractors.cheque
09 · Form 16 / 26AS / ITR-V

vs manual income proofs

Income verification.

Form 16 (gross salary, deductions, taxable income, tax paid), Form 26AS (section-wise TDS, deductor TANs, refund status), and ITR-V (ITR number, assessment year, status, refund) — with a Form 16 ↔ 26AS cross-check.

paperkit.extractors.form_16 · form_26as · itr_v

Every extraction returns schema-validated JSON (registry in paperkit/schemas/), a per-field confidence in [0,1], a PII-safe audit_payload that never contains a raw Aadhaar, account number or name, and a hash-linked Ed25519 audit-chain entry. Non-standard forms (a KCC form, a corporate vendor-onboarding form) register through the low-code schema builder — paperkit schema register.

04 — REST · gRPC · webhooks · CLI

Submit a document. Get signed JSON back.

X-PaperKit-Api-Key on every request; per-tenant token isolation at every handler. Submit a document, receive the finished result on an HMAC-signed webhook (X-Paperkit-Signature) or fetch it by id. High-throughput callers use the gRPC surface; ops and air-gapped deployments use the CLI. The same masking and audit-chain hand-off runs on all four.

POST /v1/extractions GET /v1/whoami GET /health GET /health/live webhook · X-Paperkit-Signature gRPC
# 1. Confirm the key, then submit a scanned Aadhaar for extraction.
curl -H "X-PaperKit-Api-Key: $PAPERKIT_KEY" \
     https://kyc.yourbank.internal/v1/whoami
# → { "tenant_id": "ten_yourbank_prod", "role": "kyc_operator" }

curl -X POST https://kyc.yourbank.internal/v1/extractions \
  -H "X-PaperKit-Api-Key: $PAPERKIT_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "document_type_hint": "aadhaar",
    "input_object_ref": "s3://paperkit-ten_yourbank_prod/inbox/cust88421.pdf",
    "input_size_bytes": 248194,
    "input_pages": 1
  }'
# → 202 Accepted  { "request_id": "req_…" }

CLI surface: init, extract, batch, verify-chain, deletion-cert, dossier, bsa-cert, schema register. Batch ingest accepts up to 1,000 files with a per-tenant concurrency limit (default 8, up to 64 on Enterprise). Webhook delivery retries on exponential backoff — 1m, 5m, 15m, 1h, 6h, 24h (first attempt plus six retries) — and each terminal delivery is audit-chain-anchored.

05 — Redaction in the pipeline

The clerk never sees a full Aadhaar.

Masking is not a post-processing toggle — it is a default in the extractor. The audit-chain payload-hash is computed over the masked form, so even the tamper-evident record can't leak the raw number. Only a UIDAI-empanelled KUA/AUA can flip the opt-out, and that override is itself an audit event.

from paperkit.extractors.aadhaar import mask_aadhaar, find_aadhaar_candidates

# UIDAI-compliant masking: first 8 digits → X, last 4 retained,
# formatted "XXXX-XXXX-NNNN" (Aadhaar Reg 2021 §7).
masked = mask_aadhaar("123456789012")
assert masked == "XXXX-XXXX-9012"

# Detection runs over OCR text; every 12-digit candidate is
# Verhoeff-checked. A failing checksum is flagged, never silently kept.
for hit in find_aadhaar_candidates(ocr_text):
    print(hit)        # candidates surfaced for masking, not retention

# What lands in the audit chain (PII-safe):
#   audit_payload sees the HASH + masked form only —
#   never full_aadhaar, account numbers, or names.

The same discipline runs across the catalogue: bank-statement account numbers are stored masked, the cheque MICR is decoded but the audit payload stays PII-safe, and every consent capture is an Ed25519-signed artefact under DPDP §6 that is revocable through the data-principal rights API.

06 — DPDP, RBI & the audit chain

The wedge that requires re-platforming to replicate.

PaperKit's centre of gravity is provenance. Every extraction is Ed25519-signed and hash-linked into a per-tenant chain; the chain root anchors monthly to a public certificate-transparency log so PaperKit itself cannot retroactively edit a past entry. Tampering with any past entry breaks verification. Your internal audit team verifies it without ever contacting us.

DPDP 2023 · §13

vs a database row update with no proof

Deletion certificate — two-operator gated.

Right-to-erasure issues a signed PDF certificate citing every deleted extraction ID + hash, the request reference, and the two approver IDs. The delete is irreversible (primary + replicas + object store + cache); the certificate is itself audit-chain-anchored. An empty result still issues a valid certificate — proof you checked.

paperkit.audit_chain.deletion_cert
DPDP 2023 · §17

vs PDF scans + Finacle log entries

Audit dossier — chain-verified.

A single-day query produces a signed dossier listing every relevant audit entry, the result of verifying the whole chain, and the chain's pinned state at generation time. Hand it to the Data Protection Board, an RBI inspector, or internal audit.

paperkit.audit_chain.dossier
BSA 2023 · §63

vs inadmissible paper trails

Tamper-evident evidence bundle.

Any extraction exports as a four-section BSA §63 evidence bundle, counter-signed by the tenant root key and the operator's key, with a dual IT Act §65B(4) section for the transition window — producing a tamper-evident, BSA §63-aligned evidence bundle engineered to satisfy the technical requirements (hash, algorithm, chain of custody, device identity, operator). Admissibility in any proceeding is determined by the court.

paperkit.audit_chain.bsa_cert
Ed25519 · CT log

vs trust-us audit logs

Per-tenant chain, public anchor.

Per-tenant Ed25519 root key (HSM-backed in cloud; BYOK on Enterprise). Each entry hash-links to the previous; the root anchors monthly to a public CT log. paperkit verify-chain walks every signature — green "OK", or a "TAMPER DETECTED" with the offending entry id.

paperkit.audit_chain.chain · ct_log_anchor
RBI · KYC 2016

vs FY25 "KYC audit-trail gaps" findings

Master Direction, answered.

Every OVD type (Aadhaar, passport, DL, Voter ID) extractable with an audit anchor (Para 16); per-tenant retention enforcement (Para 38); tamper-evident reproduction for inspection (Para 39); a re-KYC trigger interface (Para 45). Aligned with PMLA 2002 + PMLA Rules 2005.

RBI/2024-25/118 · 18 Mar 2024
Verification adapters

vs standalone OCR with no source-of-truth

Source-attested where it counts.

Seven first-party adapters: DigiLocker source-attested fetch, UIDAI offline-XML parse, NSDL PAN verify, GSTN GSTIN verify, Sahamati Account Aggregator, and e-Sign via eMudhra or NSDL — each a Protocol + StubAdapter + ProductionAdapter, each call audit-chain-anchored.

paperkit.adapters.digilocker · gstn · nsdl_pan · sahamati_aa · …

Two-operator approval gates the irreversible and the sensitive: DPDP §13 deletion-cert issuance, Ed25519 key rotation, cross-tenant schema sharing, operator role elevation, and white-label provisioning. Both operators must affirm within a tenant-configurable approval window (default 60 minutes) or the request times out; either rejection cancels; the approval record is in the chain. Certifications held: ISO 9001:2015, ISO 27001 (certificate available on request). In progress (targeted): CERT-In empanelment, STQC, SOC 2 Type II. Supply-chain clean-room option: excludes Baidu/PaddleOCR dependencies per tenant flag, for customers with Baidu supply-chain restrictions.

07 — Operator console

Seven roles. One FastAPI app.

The console is server-rendered FastAPI — no SPA build, no Node toolchain, runs anywhere Python runs, on the bank's network. Role-based access (Extractor, KYC Operator, Audit Reviewer, DPO, Tenant Admin, Cloud Admin, Read-Only Auditor) means a person never sees a button their role can't use. Designed to meet WCAG 2.2 AA (formal audit scheduled Q3 2026); English + Hindi at GA. Roadmap (GA+6 months): eight additional Indian languages.

View · Dashboard

Tenants, recent reads, open exceptions.

KPI tiles, recent extractions, and the open-exception count — everything a KYC ops head needs in the first ten seconds after login.

View · Extractions

List, drill in, see the chain.

Filter by document type; click into a result to see typed fields, per-field confidence bars, the masked Aadhaar, and the linked audit-chain entry. Dedicated bank-statement view for reconciled statements.

View · Review

Exception triage — 30 seconds a field.

My queue / Unassigned / Team queue. Source crop beside the read value with a correction box; one-click Confirm / Save correction / Reject-with-reason. Every action logged in the chain.

View · Audit

Verify, dossier, evidence.

Run verify-chain on a schedule (green "OK" / red "TAMPER DETECTED"), generate a DPDP §17 dossier over a date range, and produce a BSA §63 evidence certificate for a single extraction.

View · Erasure

DPDP §13 — two approvers.

Enter the data-principal reference, review the held records, first + a different second approver affirm, then Issue certificate. The system rejects matching operator ids.

View · Schemas

Built-ins + custom overlays.

Browse PaperKit-shipped schemas and tenant custom schemas, register a low-code YAML/JSON-Schema template, and pin versions — every schema change audit-chain-anchored.

08 — Deployment

Self-hosted is first-class, not an afterthought.

Every feature on SaaS is available self-hosted — same audit chain, same schema set, same DPDP-deletion flow. That parity is the line no incumbent crosses: Karza-Perfios, Signzy, Hyperverge, IDfy, AuthBridge, Textract, Document AI, Form Recogniser and Onfido are all SaaS-only or SaaS-primary.

SaaS · India-resident

  • Yotta NM1 (Mumbai) — primary region.
  • Yotta DK1 (Greater Noida) — active-active DR; RTO < 15 min.
  • Sify Hyderabad — Tier-IV secondary.
  • Per-tenant Postgres schema — no shared tables.
  • Per-tenant bucket + AES-256 key — isolated at rest.
  • 99.9% / 99.95% — Business / Enterprise-dedicated SLA.

Self-hosted · air-gap-capable

  • RHEL 9 / Ubuntu 22.04 / Win Server 2022 — your hardware.
  • FastAPI + Celery + Redis — Python 3.12 runtime.
  • SQLite or Postgres 16 — schema-portable from day one.
  • Tesseract + PaliGemma 2 / Qwen 2.5-VL — local OCR, no external calls.
  • Local HSM or SoftHSM — Ed25519 custody (flagged in the cert).
  • CT-log anchor optional — disable for fully air-gapped; chain stays self-verifiable.

An optional 30-second health beacon (opt-in for self-hosted) reports deployment id, software version, throughput, error rate and resource utilisation, with PagerDuty-compatible alert routing. Multi-tenant isolation runs to the metal: per-tenant Postgres schema, object-store bucket, encryption key, Ed25519 root key and worker queue — no admin endpoint returns cross-tenant data.

09 — How to engage

Get started with PaperKit.

Pilots are short and structured — bank pilots run 90–180 days, fintech 30–60 — on your own document corpus with the full audit chain, DPDP §13 deletion-certificate, and India-deep schema set included. SaaS (India-resident) and self-hosted (air-gapped) options are both available. Reach out and we will scope the right engagement for your volume and compliance requirements.

10 — Start

Close the KYC audit-trail gap.
Book a pilot.

A 90-day pilot on your own document corpus — the audit chain, the DPDP §13 deletion-certificate, and the India-deep schema set the incumbents don't ship. Self-hosted or India-resident SaaS.