jspg progress

This commit is contained in:
2026-02-17 17:41:54 -05:00
parent 6e06b6fdc2
commit 32ed463df8
188 changed files with 36654 additions and 15058 deletions

164
GEMINI.md
View File

@ -1,135 +1,99 @@
# Gemini Project Overview: `jspg`
# JSPG: JSON Schema Postgres
This document outlines the purpose of the `jspg` project, its architecture, and the specific modifications made to the vendored `boon` JSON schema validator crate.
**JSPG** is a high-performance PostgreSQL extension for in-memory JSON Schema validation, specifically targeting **Draft 2020-12**.
## What is `jspg`?
It is designed to serve as the validation engine for the "Punc" architecture, where the database is the single source of truth for all data models and API contracts.
`jspg` is a PostgreSQL extension written in Rust using the `pgrx` framework. Its primary function is to provide fast, in-database JSON schema validation against the 2020-12 draft of the JSON Schema specification.
## 🎯 Goals
### How It Works
1. **Draft 2020-12 Compliance**: Attempt to adhere to the official JSON Schema Draft 2020-12 specification.
2. **Ultra-Fast Validation**: Compile schemas into an optimized in-memory representation for near-instant validation during high-throughput workloads.
3. **Connection-Bound Caching**: Leverage the PostgreSQL session lifecycle to maintain a per-connection schema cache, eliminating the need for repetitive parsing.
4. **Structural Inheritance**: Support object-oriented schema design via Implicit Keyword Shadowing and virtual `.family` schemas.
5. **Punc Integration**: validation is aware of the "Punc" context (request/response) and can validate `cue` objects efficiently.
The extension is designed for high-performance scenarios where schemas are defined once and used many times for validation. It achieves this through an in-memory cache.
## 🔌 API Reference
1. **Caching and Pre-processing:** A user first calls the `cache_json_schemas(enums, types, puncs)` SQL function. This function takes arrays of JSON objects representing different kinds of schemas:
- `enums`: Standalone enum schemas (e.g., for a `task_priority` list).
- `types`: Schemas for core application data models (e.g., `person`, `organization`). These may contain a `hierarchy` array for inheritance information.
- `puncs`: Schemas for API/function-specific requests and responses.
The extension exposes the following functions to PostgreSQL:
Before compiling, `jspg` performs a crucial **pre-processing step** for type hierarchies. It inspects each definition in the `types` array. If a type includes a `hierarchy` array (e.g., a `person` type with `["entity", "organization", "user", "person"]`), `jspg` uses this to build a map of "type families."
### `cache_json_schemas(enums jsonb, types jsonb, puncs jsonb) -> jsonb`
From this map, it generates new, virtual schemas on the fly. For example, for the `organization` type, it will generate a schema with `$id: "organization.family"` that contains an `enum` of all its descendant types, such as `["organization", "user", 'person"]`.
Loads and compiles the entire schema registry into the session's memory.
This allows developers to write more flexible schemas. Instead of strictly requiring a `const` type, you can validate against an entire inheritance chain:
* **Inputs**:
* `enums`: Array of enum definitions.
* `types`: Array of type definitions (core entities).
* `puncs`: Array of punc (function) definitions with request/response schemas.
* **Behavior**:
* Parses all inputs into an internal schema graph.
* Resolves all internal references (`$ref`).
* Generates virtual `.family` schemas for type hierarchies.
* Compiles schemas into validators.
* **Returns**: `{"response": "success"}` or an error object.
```json
// In an "organization" schema definition
"properties": {
"type": {
// Allows the 'type' field to be "organization", "user", or "person"
"$ref": "organization.family",
"override": true
}
}
```
### `validate_json_schema(schema_id text, instance jsonb) -> jsonb`
Finally, all user-defined schemas and the newly generated `.family` schemas are passed to the vendored `boon` crate, compiled into an efficient internal format, and stored in a static, in-memory `SCHEMA_CACHE`. This cache is managed by a `RwLock` to allow for high-performance, concurrent reads during validation.
Validates a JSON instance against a pre-compiled schema.
2. **Validation:** The `validate_json_schema(schema_id, instance)` SQL function is then used to validate a JSONB `instance` against a specific, pre-cached schema identified by its `$id`. This function looks up the compiled schema in the cache and runs the validation, returning a success response or a detailed error report.
* **Inputs**:
* `schema_id`: The `$id` of the schema to validate against (e.g., `person`, `save_person.request`).
* `instance`: The JSON data to validate.
* **Returns**:
* On success: `{"response": "success"}`
* On failure: A JSON object containing structured errors (e.g., `{"errors": [...]}`).
3. **Custom Logic:** `jspg` uses a locally modified (vendored) version of the `boon` crate. This allows for powerful, application-specific validation logic that goes beyond the standard JSON Schema specification, such as runtime-based strictness.
### `json_schema_cached(schema_id text) -> bool`
### Error Handling
Checks if a specific schema ID is currently present in the cache.
When validation fails, `jspg` provides a detailed error report in a consistent JSON format, which we refer to as a "DropError". This process involves two main helper functions in `src/lib.rs`:
### `clear_json_schemas() -> jsonb`
1. **`collect_errors`**: `boon` returns a nested tree of `ValidationError` objects. This function recursively traverses that tree to find the most specific, underlying causes of the failure. It filters out structural errors (like `allOf` or `anyOf`) to create a flat list of concrete validation failures.
Clears the current session's schema cache, freeing memory.
2. **`format_errors`**: This function takes the flat list of errors and transforms each one into the final DropError JSON format. It also de-duplicates errors that occur at the same JSON Pointer path, ensuring a cleaner output if a single value violates multiple constraints.
### `show_json_schemas() -> jsonb`
#### DropError Format
Returns a debug dump of the currently cached schemas (for development/debugging).
A DropError object provides a clear, structured explanation of a validation failure:
## ✨ Custom Features & Deviations
```json
{
"code": "ADDITIONAL_PROPERTIES_NOT_ALLOWED",
"message": "Property 'extra' is not allowed",
"details": {
"path": "/extra",
"context": "not allowed",
"cause": {
"got": [
"extra"
]
},
"schema": "basic_strict_test.request"
}
}
```
JSPG implements specific extensions to the Draft 2020-12 standard to support the Punc architecture's object-oriented needs.
- `code` (string): A machine-readable error code (e.g., `ADDITIONAL_PROPERTIES_NOT_ALLOWED`, `MIN_LENGTH_VIOLATED`).
- `message` (string): A human-readable summary of the error.
- `details` (object):
- `path` (string): The JSON Pointer path to the invalid data within the instance.
- `context` (any): The actual value that failed validation.
- `cause` (any): The low-level reason from the validator, often including the expected value (`want`) and the actual value (`got`).
- `schema` (string): The `$id` of the schema that was being validated.
### 1. Implicit Keyword Shadowing
Standard JSON Schema composition (`allOf`) is additive (Intersection), meaning constraints can only be tightened, not replaced. However, JSPG treats `$ref` differently when it appears alongside other properties to support object-oriented inheritance.
---
* **Inheritance (`$ref` + `properties`)**: When a schema uses `$ref` *and* defines its own properties, JSPG implements **Smart Merge** (or Shadowing). If a property is defined in the current schema, its constraints take precedence over the inherited constraints for that specific keyword.
* *Example*: If `Entity` defines `type: { const: "entity" }` and `Person` (which refs Entity) defines `type: { const: "person" }`, validation passes for "person". The local `const` shadows the inherited `const`.
* *Granularity*: Shadowing is per-keyword. If `Entity` defined `type: { const: "entity", minLength: 5 }`, `Person` would shadow `const` but still inherit `minLength: 5`.
## `boon` Crate Modifications
* **Composition (`allOf`)**: When using `allOf`, standard intersection rules apply. No shadowing occurs; all constraints from all branches must pass. This is used for mixins or interfaces.
The version of `boon` located in the `validator/` directory has been significantly modified to support application-specific validation logic that goes beyond the standard JSON Schema specification.
### 2. Virtual Family Schemas (`.family`)
To support polymorphic fields (e.g., a field that accepts any "User" type), JSPG generates virtual schemas representing type hierarchies.
### 1. Property-Level Overrides for Inheritance
* **Mechanism**: When caching types, if a type defines a `hierarchy` (e.g., `["entity", "organization", "person"]`), JSPG generates a schema like `organization.family` which is a `oneOf` containing refs to all valid descendants.
- **Problem:** A primary use case for this project is validating data models that use `$ref` to create inheritance chains (e.g., a `person` schema `$ref`s a `user` schema, which `$ref`s an `entity` schema). A common pattern is to use a `const` keyword on a `type` property to identify the specific model (e.g., `"type": {"const": "person"}`). However, standard JSON Schema composition with `allOf` (which is implicitly used by `$ref`) treats these as a logical AND. This creates an impossible condition where an instance's `type` property would need to be "person" AND "user" AND "entity" simultaneously.
### 3. Strict by Default & Extensibility
JSPG enforces a "Secure by Default" philosophy. All schemas are treated as if `unevaluatedProperties: false` (and `unevaluatedItems: false`) is set, unless explicitly overridden.
- **Solution:** We've implemented a custom, explicit override mechanism. A new keyword, `"override": true`, can be added to any property definition within a schema.
* **Strictness**: By default, any property in the instance data that is not explicitly defined in the schema causes a validation error. This prevents clients from sending undeclared fields.
* **Extensibility (`extensible: true`)**: To allow additional, undefined properties, you must add `"extensible": true` to the schema. This is useful for types that are designed to be open for extension.
* **Ref Boundaries**: Strictness is reset when crossing `$ref` boundaries. The referenced schema's strictness is determined by its own definition (strict by default unless `extensible: true`), ignoring the caller's state.
* **Inheritance**: Strictness is inherited. A schema extending a strict parent will also be strict unless it declares itself `extensible: true`. Conversely, a schema extending a loose parent will also be loose unless it declares itself `extensible: false`.
```json
// person.json
{
"$id": "person",
"$ref": "user",
"properties": {
"type": { "const": "person", "override": true }
}
}
```
This signals to the validator that this definition of the `type` property should be the *only* one applied, and any definitions for `type` found in base schemas (like `user` or `entity`) should be ignored for the duration of this validation.
### 4. Format Leniency for Empty Strings
To simplify frontend form logic, the format validators for `uuid`, `date-time`, and `email` explicitly allow empty strings (`""`). This treats an empty string as "present but unset" rather than "invalid format".
#### Key Changes
## 🏗️ Architecture
This was achieved by making the validator stateful, using a pattern already present in `boon` for handling `unevaluatedProperties`.
The extension is written in Rust using `pgrx` and structures its schema parser to mirror the Punc Generator's design:
1. **Meta-Schema Update**: The meta-schema for Draft 2020-12 was modified to recognize `"override": true` as a valid keyword within a schema object, preventing the compiler from rejecting our custom schemas.
* **Single `Schema` Struct**: A unified struct representing the exact layout of a JSON Schema object, including standard keywords and custom vocabularies (`form`, `display`, etc.).
* **Compiler Phase**: schema JSONs are parsed into this struct, linked (references resolved), and then compiled into an efficient validation tree.
* **Validation Phase**: The compiled validators traverse the JSON instance using `serde_json::Value`.
2. **Compiler Modification**: The schema compiler in `validator/src/compiler.rs` was updated. It now inspects sub-schemas within a `properties` keyword and, if it finds `"override": true`, it records the name of that property in a new `override_properties` `HashSet` on the compiled `Schema` struct.
## 🧪 Testing
3. **Stateful Validator with `Override` Context**: The core `Validator` in `validator/src/validator.rs` was modified to carry an `Override` context (a `HashSet` of property names) throughout the validation process.
- **Initialization**: When validation begins, the `Override` context is created and populated with the names of any properties that the top-level schema has marked with `override`.
- **Propagation**: As the validator descends through a `$ref` or `allOf`, this `Override` context is cloned and passed down. The child schema adds its own override properties to the set, ensuring that higher-level overrides are always maintained.
- **Enforcement**: In `obj_validate`, before a property is validated, the validator first checks if the property's name exists in the `Override` context it has received. If it does, it means a parent schema has already claimed responsibility for validating this property, so the child validator **skips** it entirely. This effectively achieves the "top-level wins" inheritance model.
Testing is driven by standard Rust unit tests that load JSON fixtures.
This approach cleanly integrates our desired inheritance behavior directly into the validator with minimal and explicit deviation from the standard, avoiding the need for a complex, post-processing validation function like the old `walk_and_validate_refs`.
### 2. Recursive Runtime Strictness Control
- **Problem:** The `jspg` project requires that certain schemas (specifically those for public `puncs` and global `type`s) enforce a strict "no extra properties" policy. This strictness needs to be decided at runtime and must cascade through the entire validation hierarchy, including all nested objects and `$ref` chains. A compile-time flag was unsuitable because it would incorrectly apply strictness to shared, reusable schemas.
- **Solution:** A runtime validation option was implemented to enforce strictness recursively. This required several coordinated changes to the `boon` validator.
#### Key Changes
1. **`ValidationOptions` Struct**: A new `ValidationOptions { be_strict: bool }` struct was added to `validator/src/lib.rs`. The `jspg` code in `src/lib.rs` determines if a validation run should be strict and passes this struct to the validator.
2. **Strictness Check in `uneval_validate`**: The original `boon` only checked for unevaluated properties if the `unevaluatedProperties` keyword was present in the schema. We added an `else if be_strict` block to `uneval_validate` in `validator/src/validator.rs`. This block triggers a check for any leftover unevaluated properties at the end of a validation pass and reports them as errors, effectively enforcing our runtime strictness rule.
3. **Correct Context Propagation**: The most complex part of the fix was ensuring the set of unevaluated properties was correctly maintained across different validation contexts (especially `$ref` and nested property validations). Three critical changes were made:
- **Inheriting Context in `_validate_self`**: When validating keywords that apply to the same instance (like `$ref` or `allOf`), the sub-validator must know what properties the parent has already evaluated. We changed the creation of the `Validator` inside `_validate_self` to pass a clone of the parent's `uneval` state (`uneval: self.uneval.clone()`) instead of creating a new one from scratch. This allows the context to flow downwards.
- **Isolating Context in `validate_val`**: Conversely, when validating a property's value, that value is a *different* part of the JSON instance. The sub-validation should not affect the parent's list of unevaluated properties. We fixed this by commenting out the `self.uneval.merge(...)` call in the `validate_val` function.
- **Simplifying `Uneval::merge`**: The original logic for merging `uneval` state was different for `$ref` keywords. This was incorrect. We simplified the `merge` function to *always* perform an intersection (`retain`), which correctly combines the knowledge of evaluated properties from different schema parts that apply to the same instance.
4. **Removing Incompatible Assertions**: The changes to context propagation broke several `debug_assert!` macros in the `arr_validate` function, which were part of `boon`'s original design. Since our new validation flow is different but correct, these assertions were removed.
The tests are located in `tests/fixtures/*.json` and are executed via `cargo test`.