Architecture¶
The tool has two audit modes that share a common client layer and model layer but follow separate data flows.
Client layer¶
create_client() is the factory used everywhere. It returns a DatabricksSDKClient when databricks-sdk is installed, or a DatabricksAPIClient (raw HTTP) otherwise.
Both backends implement the AuditClient structural Protocol — swap them freely with --no-sdk or prefer_sdk=False.
DatabricksAPIClient — zero extra dependencies beyond requests:
- OAuth2 client-credentials flow with per-host token caches
- Exponential backoff retry on 429/5xx (configurable)
- Manual SCIM pagination
DatabricksSDKClient — wraps databricks-sdk:
- Automatic auth (picks up
~/.databrickscfg, env vars, or explicit credentials) - SDK-managed pagination and retries
- More reliable in enterprise environments with custom auth providers
Group audit data flow¶
GroupMembershipResolver
└─ walks account SCIM → GroupNode tree
(bulk pre-fetches users + SPs to avoid N+1 calls)
(reads externalId to classify IdP-synced vs Databricks-managed)
WorkspaceDiscovery
└─ lists all workspaces via Account API
(or uses explicit --workspace-urls)
CatalogPermissionScanner
└─ per workspace (parallel, ThreadPoolExecutor):
• list catalogs
• fetch UC grants per catalog
• classify_grant() → DIRECT / UPSTREAM / MEMBER_DIRECT
SchemaPermissionScanner (--scan-schemas)
└─ per accessible (catalog, workspace) pair (parallel):
• list schemas
• fetch UC grants per schema
• classify_grant()
TablePermissionScanner (--scan-tables)
└─ per (catalog, workspace, schema) triple (parallel):
• list tables
• fetch UC grants per table
• classify_grant()
WorkspaceObjectScanner (--scan-workspace-objects)
└─ per workspace (parallel):
• per object type (parallel, 13 types):
– list objects
– fetch /api/2.0/permissions/{type}/{id}
– classify_grant()
RedundancyDetector
└─ compares MEMBER_DIRECT grants against group's effective privileges
(ALL_PRIVILEGES expansion aware)
→ RedundancyLevel: FULL / PARTIAL / NONE
RevokeScriptGenerator (--revoke-script)
└─ REVOKE SQL for each redundant grant
Principal audit data flow¶
PrincipalAuditor.audit()
├─ resolve_principal() via account SCIM
│ → 5-tuple: (id, name, type, external_id, uc_name)
│
├─ BFS upward through all group memberships
│ → List[GroupMembership] (direct + transitive)
│
├─ get_workspace_assignments() — parallel, per workspace
│ → /permissionassignments per workspace
│ → List[WorkspaceRole]
│
├─ _get_workspace_principal_aliases() — Azure AD B2B guest UPN resolution
│ → workspace SCIM lookup by externalId
│ → discovers guest UPN stored under a separate workspace record
│
└─ scan_permissions() — parallel, per unique workspace URL
└─ _scan_one_workspace()
├─ catalog grants (+ schema/table when --scan-schemas/--scan-tables)
└─ workspace object ACLs (--scan-workspace-objects)
→ WorkspaceObjectScanner per workspace
When ws_roles is empty (principal accesses a workspace through an implicit built-in group like "account users"), the workspace object scan falls back to scanning all discovered workspaces.
Workspace roles show explicit assignments only
The /permissionassignments API only returns explicit workspace role grants — ADMIN assignments made directly on the workspace. A user who reaches a workspace solely through the implicit built-in "account users" group will show 0 workspace roles in the output, even though they can log in. This is a Databricks platform constraint, not a tool limitation. Their UC grants are still found via the separate catalog scan path.
Parallelism¶
Both modes use ThreadPoolExecutor for the expensive I/O phases:
- Group audit: workspaces scanned in parallel; schemas and tables in parallel per catalog
- Principal audit:
permissionassignmentsfetched in parallel per workspace; UC scans in parallel per workspace URL
The --workers N flag (default: 8) threads through to both parallel steps.
Credential resolution¶
_resolve_credentials() in cli.py runs before the audit and mutates the parsed args:
- If all three credentials are already set (flags or env vars), skip profile lookup.
- Otherwise load
~/.databrickscfg(orDATABRICKS_CONFIG_FILE) for the named profile. - Fill only the missing values — explicit flags take priority.
- If
--cloudwas not set and the profile has ahost, auto-detect cloud from the host. - Default
--cloudtoazureif still unset.