jspg progress
This commit is contained in:
164
GEMINI.md
164
GEMINI.md
@ -1,135 +1,99 @@
|
||||
# Gemini Project Overview: `jspg`
|
||||
# JSPG: JSON Schema Postgres
|
||||
|
||||
This document outlines the purpose of the `jspg` project, its architecture, and the specific modifications made to the vendored `boon` JSON schema validator crate.
|
||||
**JSPG** is a high-performance PostgreSQL extension for in-memory JSON Schema validation, specifically targeting **Draft 2020-12**.
|
||||
|
||||
## What is `jspg`?
|
||||
It is designed to serve as the validation engine for the "Punc" architecture, where the database is the single source of truth for all data models and API contracts.
|
||||
|
||||
`jspg` is a PostgreSQL extension written in Rust using the `pgrx` framework. Its primary function is to provide fast, in-database JSON schema validation against the 2020-12 draft of the JSON Schema specification.
|
||||
## 🎯 Goals
|
||||
|
||||
### How It Works
|
||||
1. **Draft 2020-12 Compliance**: Attempt to adhere to the official JSON Schema Draft 2020-12 specification.
|
||||
2. **Ultra-Fast Validation**: Compile schemas into an optimized in-memory representation for near-instant validation during high-throughput workloads.
|
||||
3. **Connection-Bound Caching**: Leverage the PostgreSQL session lifecycle to maintain a per-connection schema cache, eliminating the need for repetitive parsing.
|
||||
4. **Structural Inheritance**: Support object-oriented schema design via Implicit Keyword Shadowing and virtual `.family` schemas.
|
||||
5. **Punc Integration**: validation is aware of the "Punc" context (request/response) and can validate `cue` objects efficiently.
|
||||
|
||||
The extension is designed for high-performance scenarios where schemas are defined once and used many times for validation. It achieves this through an in-memory cache.
|
||||
## 🔌 API Reference
|
||||
|
||||
1. **Caching and Pre-processing:** A user first calls the `cache_json_schemas(enums, types, puncs)` SQL function. This function takes arrays of JSON objects representing different kinds of schemas:
|
||||
- `enums`: Standalone enum schemas (e.g., for a `task_priority` list).
|
||||
- `types`: Schemas for core application data models (e.g., `person`, `organization`). These may contain a `hierarchy` array for inheritance information.
|
||||
- `puncs`: Schemas for API/function-specific requests and responses.
|
||||
The extension exposes the following functions to PostgreSQL:
|
||||
|
||||
Before compiling, `jspg` performs a crucial **pre-processing step** for type hierarchies. It inspects each definition in the `types` array. If a type includes a `hierarchy` array (e.g., a `person` type with `["entity", "organization", "user", "person"]`), `jspg` uses this to build a map of "type families."
|
||||
### `cache_json_schemas(enums jsonb, types jsonb, puncs jsonb) -> jsonb`
|
||||
|
||||
From this map, it generates new, virtual schemas on the fly. For example, for the `organization` type, it will generate a schema with `$id: "organization.family"` that contains an `enum` of all its descendant types, such as `["organization", "user", 'person"]`.
|
||||
Loads and compiles the entire schema registry into the session's memory.
|
||||
|
||||
This allows developers to write more flexible schemas. Instead of strictly requiring a `const` type, you can validate against an entire inheritance chain:
|
||||
* **Inputs**:
|
||||
* `enums`: Array of enum definitions.
|
||||
* `types`: Array of type definitions (core entities).
|
||||
* `puncs`: Array of punc (function) definitions with request/response schemas.
|
||||
* **Behavior**:
|
||||
* Parses all inputs into an internal schema graph.
|
||||
* Resolves all internal references (`$ref`).
|
||||
* Generates virtual `.family` schemas for type hierarchies.
|
||||
* Compiles schemas into validators.
|
||||
* **Returns**: `{"response": "success"}` or an error object.
|
||||
|
||||
```json
|
||||
// In an "organization" schema definition
|
||||
"properties": {
|
||||
"type": {
|
||||
// Allows the 'type' field to be "organization", "user", or "person"
|
||||
"$ref": "organization.family",
|
||||
"override": true
|
||||
}
|
||||
}
|
||||
```
|
||||
### `validate_json_schema(schema_id text, instance jsonb) -> jsonb`
|
||||
|
||||
Finally, all user-defined schemas and the newly generated `.family` schemas are passed to the vendored `boon` crate, compiled into an efficient internal format, and stored in a static, in-memory `SCHEMA_CACHE`. This cache is managed by a `RwLock` to allow for high-performance, concurrent reads during validation.
|
||||
Validates a JSON instance against a pre-compiled schema.
|
||||
|
||||
2. **Validation:** The `validate_json_schema(schema_id, instance)` SQL function is then used to validate a JSONB `instance` against a specific, pre-cached schema identified by its `$id`. This function looks up the compiled schema in the cache and runs the validation, returning a success response or a detailed error report.
|
||||
* **Inputs**:
|
||||
* `schema_id`: The `$id` of the schema to validate against (e.g., `person`, `save_person.request`).
|
||||
* `instance`: The JSON data to validate.
|
||||
* **Returns**:
|
||||
* On success: `{"response": "success"}`
|
||||
* On failure: A JSON object containing structured errors (e.g., `{"errors": [...]}`).
|
||||
|
||||
3. **Custom Logic:** `jspg` uses a locally modified (vendored) version of the `boon` crate. This allows for powerful, application-specific validation logic that goes beyond the standard JSON Schema specification, such as runtime-based strictness.
|
||||
### `json_schema_cached(schema_id text) -> bool`
|
||||
|
||||
### Error Handling
|
||||
Checks if a specific schema ID is currently present in the cache.
|
||||
|
||||
When validation fails, `jspg` provides a detailed error report in a consistent JSON format, which we refer to as a "DropError". This process involves two main helper functions in `src/lib.rs`:
|
||||
### `clear_json_schemas() -> jsonb`
|
||||
|
||||
1. **`collect_errors`**: `boon` returns a nested tree of `ValidationError` objects. This function recursively traverses that tree to find the most specific, underlying causes of the failure. It filters out structural errors (like `allOf` or `anyOf`) to create a flat list of concrete validation failures.
|
||||
Clears the current session's schema cache, freeing memory.
|
||||
|
||||
2. **`format_errors`**: This function takes the flat list of errors and transforms each one into the final DropError JSON format. It also de-duplicates errors that occur at the same JSON Pointer path, ensuring a cleaner output if a single value violates multiple constraints.
|
||||
### `show_json_schemas() -> jsonb`
|
||||
|
||||
#### DropError Format
|
||||
Returns a debug dump of the currently cached schemas (for development/debugging).
|
||||
|
||||
A DropError object provides a clear, structured explanation of a validation failure:
|
||||
## ✨ Custom Features & Deviations
|
||||
|
||||
```json
|
||||
{
|
||||
"code": "ADDITIONAL_PROPERTIES_NOT_ALLOWED",
|
||||
"message": "Property 'extra' is not allowed",
|
||||
"details": {
|
||||
"path": "/extra",
|
||||
"context": "not allowed",
|
||||
"cause": {
|
||||
"got": [
|
||||
"extra"
|
||||
]
|
||||
},
|
||||
"schema": "basic_strict_test.request"
|
||||
}
|
||||
}
|
||||
```
|
||||
JSPG implements specific extensions to the Draft 2020-12 standard to support the Punc architecture's object-oriented needs.
|
||||
|
||||
- `code` (string): A machine-readable error code (e.g., `ADDITIONAL_PROPERTIES_NOT_ALLOWED`, `MIN_LENGTH_VIOLATED`).
|
||||
- `message` (string): A human-readable summary of the error.
|
||||
- `details` (object):
|
||||
- `path` (string): The JSON Pointer path to the invalid data within the instance.
|
||||
- `context` (any): The actual value that failed validation.
|
||||
- `cause` (any): The low-level reason from the validator, often including the expected value (`want`) and the actual value (`got`).
|
||||
- `schema` (string): The `$id` of the schema that was being validated.
|
||||
### 1. Implicit Keyword Shadowing
|
||||
Standard JSON Schema composition (`allOf`) is additive (Intersection), meaning constraints can only be tightened, not replaced. However, JSPG treats `$ref` differently when it appears alongside other properties to support object-oriented inheritance.
|
||||
|
||||
---
|
||||
* **Inheritance (`$ref` + `properties`)**: When a schema uses `$ref` *and* defines its own properties, JSPG implements **Smart Merge** (or Shadowing). If a property is defined in the current schema, its constraints take precedence over the inherited constraints for that specific keyword.
|
||||
* *Example*: If `Entity` defines `type: { const: "entity" }` and `Person` (which refs Entity) defines `type: { const: "person" }`, validation passes for "person". The local `const` shadows the inherited `const`.
|
||||
* *Granularity*: Shadowing is per-keyword. If `Entity` defined `type: { const: "entity", minLength: 5 }`, `Person` would shadow `const` but still inherit `minLength: 5`.
|
||||
|
||||
## `boon` Crate Modifications
|
||||
* **Composition (`allOf`)**: When using `allOf`, standard intersection rules apply. No shadowing occurs; all constraints from all branches must pass. This is used for mixins or interfaces.
|
||||
|
||||
The version of `boon` located in the `validator/` directory has been significantly modified to support application-specific validation logic that goes beyond the standard JSON Schema specification.
|
||||
### 2. Virtual Family Schemas (`.family`)
|
||||
To support polymorphic fields (e.g., a field that accepts any "User" type), JSPG generates virtual schemas representing type hierarchies.
|
||||
|
||||
### 1. Property-Level Overrides for Inheritance
|
||||
* **Mechanism**: When caching types, if a type defines a `hierarchy` (e.g., `["entity", "organization", "person"]`), JSPG generates a schema like `organization.family` which is a `oneOf` containing refs to all valid descendants.
|
||||
|
||||
- **Problem:** A primary use case for this project is validating data models that use `$ref` to create inheritance chains (e.g., a `person` schema `$ref`s a `user` schema, which `$ref`s an `entity` schema). A common pattern is to use a `const` keyword on a `type` property to identify the specific model (e.g., `"type": {"const": "person"}`). However, standard JSON Schema composition with `allOf` (which is implicitly used by `$ref`) treats these as a logical AND. This creates an impossible condition where an instance's `type` property would need to be "person" AND "user" AND "entity" simultaneously.
|
||||
### 3. Strict by Default & Extensibility
|
||||
JSPG enforces a "Secure by Default" philosophy. All schemas are treated as if `unevaluatedProperties: false` (and `unevaluatedItems: false`) is set, unless explicitly overridden.
|
||||
|
||||
- **Solution:** We've implemented a custom, explicit override mechanism. A new keyword, `"override": true`, can be added to any property definition within a schema.
|
||||
* **Strictness**: By default, any property in the instance data that is not explicitly defined in the schema causes a validation error. This prevents clients from sending undeclared fields.
|
||||
* **Extensibility (`extensible: true`)**: To allow additional, undefined properties, you must add `"extensible": true` to the schema. This is useful for types that are designed to be open for extension.
|
||||
* **Ref Boundaries**: Strictness is reset when crossing `$ref` boundaries. The referenced schema's strictness is determined by its own definition (strict by default unless `extensible: true`), ignoring the caller's state.
|
||||
* **Inheritance**: Strictness is inherited. A schema extending a strict parent will also be strict unless it declares itself `extensible: true`. Conversely, a schema extending a loose parent will also be loose unless it declares itself `extensible: false`.
|
||||
|
||||
```json
|
||||
// person.json
|
||||
{
|
||||
"$id": "person",
|
||||
"$ref": "user",
|
||||
"properties": {
|
||||
"type": { "const": "person", "override": true }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This signals to the validator that this definition of the `type` property should be the *only* one applied, and any definitions for `type` found in base schemas (like `user` or `entity`) should be ignored for the duration of this validation.
|
||||
### 4. Format Leniency for Empty Strings
|
||||
To simplify frontend form logic, the format validators for `uuid`, `date-time`, and `email` explicitly allow empty strings (`""`). This treats an empty string as "present but unset" rather than "invalid format".
|
||||
|
||||
#### Key Changes
|
||||
## 🏗️ Architecture
|
||||
|
||||
This was achieved by making the validator stateful, using a pattern already present in `boon` for handling `unevaluatedProperties`.
|
||||
The extension is written in Rust using `pgrx` and structures its schema parser to mirror the Punc Generator's design:
|
||||
|
||||
1. **Meta-Schema Update**: The meta-schema for Draft 2020-12 was modified to recognize `"override": true` as a valid keyword within a schema object, preventing the compiler from rejecting our custom schemas.
|
||||
* **Single `Schema` Struct**: A unified struct representing the exact layout of a JSON Schema object, including standard keywords and custom vocabularies (`form`, `display`, etc.).
|
||||
* **Compiler Phase**: schema JSONs are parsed into this struct, linked (references resolved), and then compiled into an efficient validation tree.
|
||||
* **Validation Phase**: The compiled validators traverse the JSON instance using `serde_json::Value`.
|
||||
|
||||
2. **Compiler Modification**: The schema compiler in `validator/src/compiler.rs` was updated. It now inspects sub-schemas within a `properties` keyword and, if it finds `"override": true`, it records the name of that property in a new `override_properties` `HashSet` on the compiled `Schema` struct.
|
||||
## 🧪 Testing
|
||||
|
||||
3. **Stateful Validator with `Override` Context**: The core `Validator` in `validator/src/validator.rs` was modified to carry an `Override` context (a `HashSet` of property names) throughout the validation process.
|
||||
- **Initialization**: When validation begins, the `Override` context is created and populated with the names of any properties that the top-level schema has marked with `override`.
|
||||
- **Propagation**: As the validator descends through a `$ref` or `allOf`, this `Override` context is cloned and passed down. The child schema adds its own override properties to the set, ensuring that higher-level overrides are always maintained.
|
||||
- **Enforcement**: In `obj_validate`, before a property is validated, the validator first checks if the property's name exists in the `Override` context it has received. If it does, it means a parent schema has already claimed responsibility for validating this property, so the child validator **skips** it entirely. This effectively achieves the "top-level wins" inheritance model.
|
||||
Testing is driven by standard Rust unit tests that load JSON fixtures.
|
||||
|
||||
This approach cleanly integrates our desired inheritance behavior directly into the validator with minimal and explicit deviation from the standard, avoiding the need for a complex, post-processing validation function like the old `walk_and_validate_refs`.
|
||||
|
||||
### 2. Recursive Runtime Strictness Control
|
||||
|
||||
- **Problem:** The `jspg` project requires that certain schemas (specifically those for public `puncs` and global `type`s) enforce a strict "no extra properties" policy. This strictness needs to be decided at runtime and must cascade through the entire validation hierarchy, including all nested objects and `$ref` chains. A compile-time flag was unsuitable because it would incorrectly apply strictness to shared, reusable schemas.
|
||||
|
||||
- **Solution:** A runtime validation option was implemented to enforce strictness recursively. This required several coordinated changes to the `boon` validator.
|
||||
|
||||
#### Key Changes
|
||||
|
||||
1. **`ValidationOptions` Struct**: A new `ValidationOptions { be_strict: bool }` struct was added to `validator/src/lib.rs`. The `jspg` code in `src/lib.rs` determines if a validation run should be strict and passes this struct to the validator.
|
||||
|
||||
2. **Strictness Check in `uneval_validate`**: The original `boon` only checked for unevaluated properties if the `unevaluatedProperties` keyword was present in the schema. We added an `else if be_strict` block to `uneval_validate` in `validator/src/validator.rs`. This block triggers a check for any leftover unevaluated properties at the end of a validation pass and reports them as errors, effectively enforcing our runtime strictness rule.
|
||||
|
||||
3. **Correct Context Propagation**: The most complex part of the fix was ensuring the set of unevaluated properties was correctly maintained across different validation contexts (especially `$ref` and nested property validations). Three critical changes were made:
|
||||
- **Inheriting Context in `_validate_self`**: When validating keywords that apply to the same instance (like `$ref` or `allOf`), the sub-validator must know what properties the parent has already evaluated. We changed the creation of the `Validator` inside `_validate_self` to pass a clone of the parent's `uneval` state (`uneval: self.uneval.clone()`) instead of creating a new one from scratch. This allows the context to flow downwards.
|
||||
- **Isolating Context in `validate_val`**: Conversely, when validating a property's value, that value is a *different* part of the JSON instance. The sub-validation should not affect the parent's list of unevaluated properties. We fixed this by commenting out the `self.uneval.merge(...)` call in the `validate_val` function.
|
||||
- **Simplifying `Uneval::merge`**: The original logic for merging `uneval` state was different for `$ref` keywords. This was incorrect. We simplified the `merge` function to *always* perform an intersection (`retain`), which correctly combines the knowledge of evaluated properties from different schema parts that apply to the same instance.
|
||||
|
||||
4. **Removing Incompatible Assertions**: The changes to context propagation broke several `debug_assert!` macros in the `arr_validate` function, which were part of `boon`'s original design. Since our new validation flow is different but correct, these assertions were removed.
|
||||
The tests are located in `tests/fixtures/*.json` and are executed via `cargo test`.
|
||||
Reference in New Issue
Block a user