System Design: Building an Audit Log System for Compliance and Debugging

Audit logs answer a simple question: who did what, to which resource, from where, and when?

That question matters during security investigations, customer support, compliance audits, data recovery, and debugging. A normal application log is not enough. Application logs are optimized for engineers. Audit logs are product and compliance records.

An audit log system must be durable, queryable, immutable, and careful with sensitive data.

Requirements

Mental Model

Connecting isolated components into a resilient, scalable, and observable distributed web.

graph LR
    Producer[Producer Service] -->|Publish Event| Kafka[Kafka / Event Bus]
    Kafka -->|Consume| Consumer1[Consumer Group A]
    Kafka -->|Consume| Consumer2[Consumer Group B]
    Consumer1 --> DB1[(Primary DB)]
    Consumer2 --> Cache[(Redis)]

Functional requirements:

record user actions
record system actions
show audit history for a resource
search by actor, action, resource, tenant, and time range
export logs for compliance
retain logs based on policy

Non-functional requirements:

high write availability
append-only behavior
tamper resistance
low-latency search for recent data
cheap storage for old data
PII minimization
tenant isolation

Event Schema

A practical audit event:

{
  "eventId": "evt_01HXYZ",
  "tenantId": "t_123",
  "actor": {
    "type": "USER",
    "id": "u_456",
    "emailHash": "f2a1..."
  },
  "action": "ROLE_ASSIGNED",
  "resource": {
    "type": "USER_ROLE",
    "id": "role_admin"
  },
  "result": "SUCCESS",
  "ipAddress": "203.0.113.10",
  "userAgent": "Mozilla/5.0",
  "requestId": "req_789",
  "occurredAt": "2025-07-24T10:15:30Z",
  "metadata": {
    "targetUserId": "u_999"
  }
}

Avoid storing raw sensitive data when a stable hash is enough. For example, emailHash may be enough for investigation without storing the full email in the audit stream.

Write Path

There are two common approaches.

Synchronous Write

The API writes audit logs inside the request path:

userRoleService.assignRole(userId, role);
auditLogService.record(RoleAssignedEvent.from(userId, role));

This is simple but risky. If the audit log store is slow, the product action becomes slow. If audit logging fails, do you fail the user request? For compliance-critical actions, maybe yes. For lower-risk actions, maybe no.

Asynchronous Write

The API publishes an event and an audit consumer persists it:

@Transactional
public void assignRole(String userId, String role) {
    roleRepository.assign(userId, role);
    outboxRepository.save(AuditEvent.roleAssigned(userId, role));
}

Then a publisher sends the audit event to Kafka:

application -> outbox table -> Kafka -> audit-log-service -> storage

This avoids losing audit events when the app crashes after the business transaction commits.

Storage Model

Audit logs are append-heavy. A relational table works well for moderate volume:

CREATE TABLE audit_events (
  event_id UUID PRIMARY KEY,
  tenant_id VARCHAR(128) NOT NULL,
  actor_type VARCHAR(50) NOT NULL,
  actor_id VARCHAR(128) NOT NULL,
  action VARCHAR(100) NOT NULL,
  resource_type VARCHAR(100) NOT NULL,
  resource_id VARCHAR(128) NOT NULL,
  result VARCHAR(20) NOT NULL,
  occurred_at TIMESTAMP NOT NULL,
  request_id VARCHAR(128),
  ip_address INET,
  metadata JSONB NOT NULL DEFAULT '{}'
);

CREATE INDEX idx_audit_resource_time
  ON audit_events (tenant_id, resource_type, resource_id, occurred_at DESC);

CREATE INDEX idx_audit_actor_time
  ON audit_events (tenant_id, actor_id, occurred_at DESC);

For high-volume systems, use a two-tier model:

PostgreSQL or OpenSearch for recent searchable events
S3/Glacier for long-term retention

Search Design

Common access patterns:

"Show all changes to user u_123"
"Show all actions by admin a_456 last week"
"Show failed login attempts for tenant t_1"
"Export all permission changes for Q2"

OpenSearch mapping should keep fields structured:

{
  "mappings": {
    "properties": {
      "tenantId": { "type": "keyword" },
      "actor.id": { "type": "keyword" },
      "action": { "type": "keyword" },
      "resource.type": { "type": "keyword" },
      "resource.id": { "type": "keyword" },
      "occurredAt": { "type": "date" },
      "metadata": { "type": "flattened" }
    }
  }
}

Do not index every nested metadata field dynamically forever. Mapping explosion is a real production problem.

Immutability and Tamper Resistance

Audit logs should be append-only. Application code should not update or delete individual events.

At the database layer:

REVOKE UPDATE, DELETE ON audit_events FROM app_user;
GRANT INSERT, SELECT ON audit_events TO app_user;

For stronger tamper evidence, add hash chaining:

{
  "eventId": "evt_2",
  "payloadHash": "hash(current_payload)",
  "previousHash": "hash(evt_1)"
}

If someone modifies an old event, the chain breaks. This is not a replacement for access control, but it helps detect tampering.

For regulated environments, write old logs to S3 with Object Lock/WORM retention.

Retention and PII

Retention is a policy decision. Do not keep audit logs forever by default.

Example:

Event Type	Retention
Authentication events	1 year
Permission changes	7 years
Billing changes	7 years
Debug-only admin views	90 days

PII rules:

store IDs instead of names/emails where possible
hash sensitive values used only for matching
encrypt long-term archives
restrict who can search audit logs
log access to the audit log itself

Production Checklist

Define audit-worthy actions explicitly
Use an append-only event schema
Write through outbox for critical actions
Store structured fields, not free-form strings only
Index by tenant, actor, resource, action, and time
Separate recent search storage from long-term archive
Minimize PII
Make audit log access itself auditable
Add retention policies
Consider hash chaining or WORM storage for tamper evidence

An audit log system is not just a compliance checkbox. It is the memory of your product. Design it as a reliable event system, and it will pay for itself during the first serious investigation.

Technical Trade-offs: Messaging Systems

Pattern	Ordering	Durability	Throughput	Complexity
Log-based (Kafka)	Strict (per partition)	High	Very High	High
Memory-based (Redis Pub/Sub)	None	Low	High	Very Low
Push-based (RabbitMQ)	Fair	Medium	Medium	Medium

Key Takeaways

record user actions
record system actions
show audit history for a resource

Verbal Interview Script

Interviewer: "How would you ensure high availability and fault tolerance for this specific architecture?"

Candidate: "To achieve 'Five Nines' (99.999%) availability, we must eliminate all Single Points of Failure (SPOF). I would deploy the API Gateway and stateless microservices across multiple Availability Zones (AZs) behind an active-active load balancer. For the data layer, I would use asynchronous replication to a read-replica in a different region for disaster recovery. Furthermore, it's not enough to just deploy redundantly; we must protect the system from cascading failures. I would implement strict timeouts, retry mechanisms with exponential backoff and jitter, and Circuit Breakers (using a library like Resilience4j) on all synchronous network calls between microservices."