1 of 4

Configuration

API Data Sanitization

Psoxy supports specifying sanitization rule sets to use to sanitize data from an API. These can be configured by encoding a rule set in YAML and setting a parameter in your instance's configuration. See an example of rules for Zoom: zoom.yaml.

If such a parameter is not set, a proxy instances selects default rules based on source kind, from the corresponding supported source.

You can configure custom rule sets for a given instance via Terraform, by adding an entry to the custom_api_connector_rules map in your terraform.tfvars file.

eg,

custom_api_connector_rules = {
    gmail: "custom-gmail.yaml"
}

API Connector Rules Syntax

<ruleset> ::= "endpoints:" <endpoint-list> <endpoint-list> ::= <endpoint> | <endpoint> <endpoint-list>

A ruleset is a list of API endpoints that are permitted to be invoked through the proxy. Requests which do not match a endpoint in this list will be rejected with a 403 response.

Endpoint Specification

<endpoint> ::= <path-template> <allowed-methods> <path-parameter-schemas> <query-parameter-schemas> <response-schema> <transforms>

<path-template> ::= "- pathTemplate: " <string> Each endpoint is specified by a path template, based on OpenAPI Spec v3.0.0 Path Template syntax. Variable path segments are enclosed in curly braces ({}) and are matched by any value that does not contain an / character.

See: https://swagger.io/docs/specification/paths-and-operations/

Allowed Methods

<allowed-methods> ::= "- allowedMethods: " <method-list> <method-list> ::= <method> | <method> <method-list>

If provided, only HTTP methods included in this list will be permitted for the endpoint. Given semantics of RESTful APIs, this allows an additional point to enforce "read-only" data access, in addition to OAuth scopes/etc.

NOTE: for AWS-hosted deployments using API Gateway, IAM policies and routes may also be used to restrict HTTP methods. See aws/guides/api-gateway.md for more details.

Path / Query Parameter Schemas

<path-parameter-schemas> ::= "- pathParameterSchemas: " <parameter-schema> <query-parameter-schemas> ::= "- queryParameterSchemas: " <parameter-schema>

alpha - a parameter schema to use to validate path/query parameter values; if validation fails, proxy will return 403 forbidden response. Given the use-case of validating URL / query parameters, only a small subset of JSON Schema is supported.

As of 0.4.38, this is considered an alpha feature which may change in backwards-incompatible ways.

Currently, the supports JSON Schema features are:

type :
- string value must be a JSON string
- integer value must be a JSON integer
- number value must be a JSON number
format :
- reversible-pseudonym : value MUST be a reversible pseudonym generated by the proxy
pattern : a regex pattern to match against the value
enum : a list of values to match against the value

null/empty is valid for all types; you can use a pattern to restrict this further.

Response Schema

<response-schema> ::= "responseSchema: " <json-schema-filter>

See: Response Schema Specification below.

<transforms> ::= "transforms:" <transform-list> <transform-list> ::= <transform> | <transform> <transform-list>

For each Endpoint, rules specify a list of transforms to apply to the response content.

Transform Specification

<transform> ::= "- " <transform-type> <json-paths> [<encoding>]

Each transform is specified by a transform type and a list of JSON paths. The transform is applied to all portions of the response content that match any of the JSON paths.

Supported Transform Types:

NOTE: these are implementations of com.avaulta.gateway.rules.transforms.Transform class in the psoxy codebase.

Pseudonymize

!<pseudonymize> - transforms matching values by normalizing them (triming whitespace; if appear to be emails, treating them as case-insensitive, etc) and computing a SHA-256 hash of the normalized value. Relies on SALT value configured in your proxy environment to ensure the SHA-256 is deterministic across time and between sources. In the case of emails, the domain portion is preserved, although the hash is still based on the entire normalized value (avoids hash of alice@acme.com matching hash of alice@beta.com).

Options:

includeReversible (default: false): If true, an encrypted form of the original value will be included in the result. This value, if passed back to the proxy in a URL, will be decrypted back to the original value before the request is forward to the data source. This is useful for identifying values that are needed as parameters for subsequent API requests. This relies on symmetric encryption using the ENCRYPTION_KEY secret stored in the proxy; if ENCRYPTION_KEY is rotated, any 'reversible' value previously generated will no longer be able to be decrypted by the proxy.
encoding (default: JSON): The encoding to use when serializing the pseudonym to a string.
- JSON - a JSON object structure, with explicit fields
- URL_SAFE_TOKEN - a string format that aims to be concise, URL-safe, and format-preserving for email case.

Pseudonymize Email Header

!<pseudonymizeEmailHeader> - transforms matching values by parsing the value as an email header, in accordance with RFC 2822 and some typical conventions, and generating a pseudonym based only on the normalized email address itself (ignoring name, etc that may appear) . In particular:

deals with CSV lists (multiple emails in a single header)
handles the name <email> format, in effect redacting the name and replacing with a pseudonym based only on normalized email

Redact

!<redact> - removes the matching values from the response.

Some extensions of redaction are also supported:

!<redactExceptSubstringsMatchingRegexes> - removes the matching values from the response except value matches one of the specified regex options. (Use case: preserving portions of event titles if match variants of 'Focus Time', 'No Meetings', etc)
!<redactRegexMatches> - redact content IF it matches one of the regexs included as an option.

By using a negation in the JSON Path for the transformation, !<redact> can be used to implement default-deny style rules, where all fields are redacted except those explicitly listed in the JSON Path expression. This can also redact object-valued fields, conditionally based on object properties as shown below.

Eg, the following redacts all headers that have a name value other than those explicitly listed below:

- !<redact>
  jsonPaths:
    - "$.messages.payload.headers[?(!(@.name =~ /^From|To|Cc|Bcc|X-Original-Sender|Delivered-To|Sender|Message-ID|Date|In-Reply-To|Original-Message-ID|References$/i))]"

Tokenize

!<tokenize> - replaces matching values it with a reversible token, which proxy can reverse to the original value using ENCRYPTION_KEY secret stored in the proxy in subsequent requests.

Use case are values that may be sensitive, but are opaque. For example, page tokens in Microsoft Graph API do not have a defined structure, but in practice contain PII.

Options:

regex a capturing regex to use to extract portion of value that needs to be tokenized.

Filter Tokens by Regex

!<filterTokenByRegex> - tokenizes matching string values by a delimiter, if provided; and matches result against a list of filters, removing any content that doesn't match at least one of the filters. (Use case: preserving Zoom URLs in meeting descriptions, while removing the rest of the description)

Options:

delimiter - used to split the value into tokens; if not provided, the entire value is treated as a single token.
filters - in effect, combined via OR; tokens matching ANY of the filters is preserved in the value.

Response Schema Specification

A "response schema" is a "JSON Schema Filter" structure, specifying how response (which must be JSON) should be filtered. Using this, you can implement a "default deny" approach to sanitizing API fields in a manner that may be more convenient than using JSON paths with conditional negations (a redact transform with a JSON path that matches all but an explicit list of named fields is the other approach to implementing 'default deny' style rules).

Our "JSON Schema Filter" implementation attempts to align to the JSON Schema specification, with some variation as it is intended for filtering rather than validation. But generally speaking, you should be able to copy the JSON Schema for an API endpoint from its OpenAPI specification as a starting point for the responseSchema value in your rule set. Similarly, there are tools that can generate JSON Schema from example JSON content, as well as from data models in various languages, that may be useful.

See: https://json-schema.org/implementations.html#schema-generators

If a responseSchema attribute is specified for an endpoint, the response content will be filtered (rather than validated) against that schema. Eg, fields NOT specified in the schema, or not of expected type, will be removed from the response.

type - one of :
- object a JSON object
- array a JSON array
- string a JSON string
- number a JSON number, either integer or decimal.
- integer a JSON integer (not a decimal)
- boolean a JSON boolean
properties - for type == object, a map of field names to schema to filter field's value against (eg, another JsonSchemaFilter for the field itself)
items - for type == array, a schema to filter each item in the array against (again, a JsonSchemaFilter)
format - for type == string, a format to expect for the string value. As of v0.4.38, this is not enforced by the proxy.
$ref - a reference to a schema specified in the definitions property of the root schema.
definitions - a map of schema names to schemas of type JsonSchemaFilter; only supported at root schema of endpoint.

Example:

The following is for a User from the GitHub API, via graphql. See: https://docs.github.com/en/graphql/guides/forming-calls-with-graphql#the-graphql-endpoint

- pathTemplate: "/graphql"
  responseSchema:
    type: "object"
    properties:
      data:
        type: "object"
        properties:
          organization:
            type: "object"
            properties:
              samlIdentityProvider:
                type: "object"
                properties:
                  externalIdentities:
                    type: "object"
                    properties:
                      pageInfo:
                        type: "object"
                        properties:
                          hasNextPage:
                            type: "boolean"
                          endCursor:
                            type: "string"
                      edges:
                        type: "array"
                        items:
                          type: "object"
                          properties:
                            node:
                              type: "object"
                              properties:
                                guid:
                                  type: "string"
                                samlIdentity:
                                  type: "object"
                                  properties:
                                    nameId:
                                      type: "string"
                                user:
                                  type: "object"
                                  properties:
                                    login:
                                      type: "string"
              membersWithRole:
                type: "object"
                properties:
                  pageInfo:
                    type: "object"
                    properties:
                      hasNextPage:
                        type: "boolean"
                      endCursor:
                        type: "string"
                  edges:
                    type: "array"
                    items:
                      type: "object"
                      properties:
                        node:
                          type: "object"
                          properties:
                            email:
                              type: "string"
                            login:
                              type: "string"
                            id:
                              type: "string"
                            isSiteAdmin:
                              type: "boolean"
                            organizationVerifiedDomainEmails:
                              type: "array"
                              items:
                                type: "string"
      errors:
        type: "array"
        items:
          type: "object"
          properties:
            type:
              type: "string"
            path:
              type: "array"
              items:
                type: "string"
            locations:
              type: "array"
              items:
                type: "object"
                properties:
                  line:
                    type: "integer"
                  column:
                    type: "integer"
            message:
              type: "string"

Bulk File Sanitization

Overview

Psoxy can be used to sanitize bulk files (eg, CSV, NDSJON, etc), writing the result to another bucket.

You can automate a data pipeline to push files to an -input bucket, which will trigger a Psoxy instance (GCP Cloud Function or AWS Lambda), which will read the file, sanitize it, and write the result to a corresponding -sanitized bucket.

You should limit the size of files processed by proxy to 200k rows or less, to ensure processing of any single file finishes within the run time limitations of the host platform (AWS, GCP). There is some flexibility here based on the complexity of your rules and file schema, but we've found 200k to be a conservative target.

Compression

To improve performance and reduce storage costs, you should compress (gzip) the files you write to the -input bucket. Psoxy will decompress gzip files before processing and then compress the result before writing to the -sanitized bucket. Ensure that you set Content-Encoding: gzip on all files in your -input bucket to enable this behavior. Note that if you are uploading files via the web UI in GCP/AWS, it is not possible to set this metadata in the initial upload - so you cannot use compression in such a scenario.

Sanitization Rules

The 'bulk' mode of Psoxy supports either column-oriented or record-oriented file formats.

Column-Oriented Formats (ColumnarRules)

To cater to column-oriented file formats (Eg .csv, .tsv), Psoxy supports a ColumnarRules format for encoding your sanitization rules. This rules format provides simple/concise configuration for these cases, where more complex processing of repeated values / complex field types is required.

If your use-case is record oriented (eg, NDJSON, etc), with nested or repeated fields, then you will likely need RecordRules as an alternative.

Pseudonymization

The core function of the Proxy is to pseudonymize PII in your data. To pseudonymize a column, add it to columnsToPseudonymize.

columnsToPseudonymize:
  - employee_id
  - employee_email
  - manager_id

To avoid inadvertent data leakage, if a column specified to be pseudonymized is not present in the input data, the Proxy will fail with an error. This is to avoid simple column name typos resulting in data leakage.

Additional Transformations

To ease integration, the 'bulk' mode also supports a few additional common transformations that may be useful. These provide an alternative to using a separate ETL tool to transform your data, or modifying your existing data export pipelines.

Redaction

To redact a column, add it to columnsToRedact. By default, all columns present in the input data will be included in the output data, unless explicitly redacted.

columnsToRedact:
  - salary

Inclusion

Alternatively to redacting columns, you can specify columnsToInclude. If specified, only columns explicitly included will be included in the output data.

columnsToInclude:
  - employee_id
  - employee_email
  - manager_id
  - team
  - department

Renaming Columns

To rename a column, add it to columnsToRename, which is a map from original name --> desired name. Renames are applied before pseudonymization.

columnsToRename:
  termination_date: leave_date

This feature supports simple adaptation of existing data pipelines for use in Worklytics.

Record-Oriented Formats (RecordRules)

As of Oct 2023, this is a beta feature

RecordRules parses files as records, presuming the specified format. It performs transforms in order on each record to sanitize your data, and serializes the result back to the specified format.

eg.

format: NDJSON
transforms:
  - redact: "$.summary"
  - pseudonymize: "$.email"

Each transform is a map from transform type --> to a JSONPath to which the transform should be applied. The JSONPath is evaluated from the root of each record in the file.

The above example rules applies two transforms. First, it redacts $.summary - the summary field at the root at of the record object. Second, it pseudonymizes $.email - the email field at the root of the record object.

transforms itself is an ordered-list of transforms. The transforms should be applied in order.

CSV format is also supported, but in effect is converted to a simple JSON object before rules are applied; so JSON paths in transforms should all be single-level; eg, $.email to refer to the email column in the CSV.

Mixing File Formats

As of Oct 2023, this feature is in beta and may change in backwards incompatible ways

You can process multiple file formats through a single proxy instance using MultiTypeBulkDataRules.

These rules are structured with a field fileRules, which is a map from parameterized path template within the "input" bucket to one of the above rule types (RecordRules,ColumnarRules) to be applied to files matching that path template.

fileRules:
  /export/{week}/index_{shard}.ndjson:
    format: "NDJSON"
    transforms:
      - redact: "$.foo"
      - pseudonymize: "$.bar"
  /export/{week}/data_{shard}.csv:
    columnsToPseudonymize:
      - "email"
    delimiter: ","
    pseudonymFormat: "JSON"

Path templates are evaluated against the incoming file (object) path in order, and the first match is applied to the file. If no templates match the incoming file, it will not be processed.

Configuration

Worklytics' provided Terraform modules include default rules for expected formats for hris, survey, and badge data connectors.

If your input data does not match the expected formats, you can customize the rules in one of the following ways.

NOTE: The configuration approaches described below utilized Terraform variables as provided by our gcp and aws template examples. Other examples may not support these variables; please consult the variables.tf at the root of your configuration. If you are directly using Worklytics' Terraform modules, you can consult the variables.tf in the module directory to see if these variables are exposed.

Custom Rules of Predefined Bulk Connector (preferred)

You can override the rules used by the predefined bulk connectors (eg hris, survey, badge) by filling the custom_bulk_connector_rules variable in your Terraform configuration.

This variable is a map from connector ID --> rules, with the rules encoded in HCL format (rather than YAML as shown above). An illustrative example:

custom_bulk_connector_rules = {
  hris = {
    pseudonymFormat = "URL_SAFE_TOKEN"
    columnsToRename = {
      termination_date = "leave_date"
    }
    columnsToPseudonymize = [
      "employee_id",
      "employee_email",
      "manager_id"
    ]
    columnsToRedact = [
      "salary"
    ]
  }
}

This approach ONLY supports ColumnarRules

Custom Bulk Connector

Rather than enabling one of the predefined bulk connectors providing in the worklytics-connector-specs Terraform module, you can specify a custom connector from scratch, including your own rules.

This approach is less convenient than the previous one, as TODO documentation and deep-links for connecting your data to Worklytics will not be generated.

To create a Custom Bulk Connector, use the custom_bulk_connectors variable in your Terraform configuration, for example:

custom_bulk_connectors = {
  my_custom_bulk_connector_id = {
    sourceKindId = "my_custom_bulk_data"
    rules = {
      pseudonymFormat = "URL_SAFE_TOKEN"
      columnsToRename = {
        termination_date = "leave_date"
      }
      columnsToPseudonymize = [
        "employee_id",
        "employee_email",
        "manager_id"
      ]
      columnsToRedact = [
        "salary"
      ]
    }
  }
}

The above example is for ColumnarRules.

Direct Configuration of RULES

You can directly modify the RULES environment variable on the Psoxy instance, by directly editing your instance's environment via your hosting provider's console or CLI. In this case, the rules should be encoded in YAML format, such as:

pseudonymFormat: URL_SAFE_TOKEN
columnsToRename:
  termination_date: leave_date
columnsToPseudonymize:
  - employee_id
  - employee_email
  - manager_id
columnsToRedact:
  - salary

Alternatively, you can remove the environment variable from your instance, and instead configure a RULES value in the "namespace" of your instance, in the AWS Parameter Store or GCP Secret Manager (as appropriate for your hosting provider).

This approach is useful for testing, but note that if you later run terraform apply again, any changes you make to the environment variable may be overwritten by Terraform.

JSON Filter

JSON Filter is inspired by JSON Schema, but with the goal to filter documents rather than validate. As such, the basic idea is that data nodes that do not match the filter schema are removed, rather than the whole document failing validation.

The goal of JsonFilter is that only data elements specified in the filter pass through.

Some differences:

required properties are ignored. While in JSON schema, an object that was missing a "required" property is invalid, objects missing "required" properties in a filter will be preserved.
{ } , eg, a schema without a type, is interpreted as any valid leaf node (eg, unconstrained leaf; everything that's not 'array' or 'object') - rather than any valid JSON.

Compatibility goals:

a valid JSON Schema is convertible to valid JSON filter (with JSON schema features not supported by JSON filter ignored)

Motivations

`{ }` as "any valid leaf node"

compactness, esp when encoding filter as YAML. can put { } instead of { "type": "string" }
flexibility; for filtering use-case, often you just care about which properties are/aren't passed, rather than 'string' vs 'number' vs 'integer'
"any valid JSON" is more common use case in validation than in filtering.

Bulk File Sanitization

Overview

Psoxy can be used to sanitize bulk files (eg, CSV, NDSJON, etc), writing the result to another bucket.

Compression

Sanitization Rules

The 'bulk' mode of Psoxy supports either column-oriented or record-oriented file formats.

Column-Oriented Formats (ColumnarRules)

If your use-case is record oriented (eg, NDJSON, etc), with nested or repeated fields, then you will likely need RecordRules as an alternative.

Pseudonymization

The core function of the Proxy is to pseudonymize PII in your data. To pseudonymize a column, add it to columnsToPseudonymize.

columnsToPseudonymize:
  - employee_id
  - employee_email
  - manager_id

Additional Transformations

Redaction

To redact a column, add it to columnsToRedact. By default, all columns present in the input data will be included in the output data, unless explicitly redacted.

columnsToRedact:
  - salary

Inclusion

Alternatively to redacting columns, you can specify columnsToInclude. If specified, only columns explicitly included will be included in the output data.

columnsToInclude:
  - employee_id
  - employee_email
  - manager_id
  - team
  - department

Renaming Columns

To rename a column, add it to columnsToRename, which is a map from original name --> desired name. Renames are applied before pseudonymization.

columnsToRename:
  termination_date: leave_date

This feature supports simple adaptation of existing data pipelines for use in Worklytics.

Record-Oriented Formats (RecordRules)

As of Oct 2023, this is a beta feature

RecordRules parses files as records, presuming the specified format. It performs transforms in order on each record to sanitize your data, and serializes the result back to the specified format.

eg.

format: NDJSON
transforms:
  - redact: "$.summary"
  - pseudonymize: "$.email"

Each transform is a map from transform type --> to a JSONPath to which the transform should be applied. The JSONPath is evaluated from the root of each record in the file.

transforms itself is an ordered-list of transforms. The transforms should be applied in order.

Mixing File Formats

As of Oct 2023, this feature is in beta and may change in backwards incompatible ways

You can process multiple file formats through a single proxy instance using MultiTypeBulkDataRules.

fileRules:
  /export/{week}/index_{shard}.ndjson:
    format: "NDJSON"
    transforms:
      - redact: "$.foo"
      - pseudonymize: "$.bar"
  /export/{week}/data_{shard}.csv:
    columnsToPseudonymize:
      - "email"
    delimiter: ","
    pseudonymFormat: "JSON"

Path templates are evaluated against the incoming file (object) path in order, and the first match is applied to the file. If no templates match the incoming file, it will not be processed.

Configuration

Worklytics' provided Terraform modules include default rules for expected formats for hris, survey, and badge data connectors.

If your input data does not match the expected formats, you can customize the rules in one of the following ways.

Custom Rules of Predefined Bulk Connector (preferred)

You can override the rules used by the predefined bulk connectors (eg hris, survey, badge) by filling the custom_bulk_connector_rules variable in your Terraform configuration.

This variable is a map from connector ID --> rules, with the rules encoded in HCL format (rather than YAML as shown above). An illustrative example:

custom_bulk_connector_rules = {
  hris = {
    pseudonymFormat = "URL_SAFE_TOKEN"
    columnsToRename = {
      termination_date = "leave_date"
    }
    columnsToPseudonymize = [
      "employee_id",
      "employee_email",
      "manager_id"
    ]
    columnsToRedact = [
      "salary"
    ]
  }
}

This approach ONLY supports ColumnarRules

Custom Bulk Connector

Rather than enabling one of the predefined bulk connectors providing in the worklytics-connector-specs Terraform module, you can specify a custom connector from scratch, including your own rules.

This approach is less convenient than the previous one, as TODO documentation and deep-links for connecting your data to Worklytics will not be generated.

To create a Custom Bulk Connector, use the custom_bulk_connectors variable in your Terraform configuration, for example:

custom_bulk_connectors = {
  my_custom_bulk_connector_id = {
    sourceKindId = "my_custom_bulk_data"
    rules = {
      pseudonymFormat = "URL_SAFE_TOKEN"
      columnsToRename = {
        termination_date = "leave_date"
      }
      columnsToPseudonymize = [
        "employee_id",
        "employee_email",
        "manager_id"
      ]
      columnsToRedact = [
        "salary"
      ]
    }
  }
}

The above example is for ColumnarRules.

Direct Configuration of RULES

pseudonymFormat: URL_SAFE_TOKEN
columnsToRename:
  termination_date: leave_date
columnsToPseudonymize:
  - employee_id
  - employee_email
  - manager_id
columnsToRedact:
  - salary

This approach is useful for testing, but note that if you later run terraform apply again, any changes you make to the environment variable may be overwritten by Terraform.

API Data Sanitization

If such a parameter is not set, a proxy instances selects default rules based on source kind, from the corresponding supported source.

You can configure custom rule sets for a given instance via Terraform, by adding an entry to the custom_api_connector_rules map in your terraform.tfvars file.

eg,

custom_api_connector_rules = {
    gmail: "custom-gmail.yaml"
}

API Connector Rules Syntax

<ruleset> ::= "endpoints:" <endpoint-list> <endpoint-list> ::= <endpoint> | <endpoint> <endpoint-list>

A ruleset is a list of API endpoints that are permitted to be invoked through the proxy. Requests which do not match a endpoint in this list will be rejected with a 403 response.

Endpoint Specification

<endpoint> ::= <path-template> <allowed-methods> <path-parameter-schemas> <query-parameter-schemas> <response-schema> <transforms>

See: https://swagger.io/docs/specification/paths-and-operations/

Allowed Methods

<allowed-methods> ::= "- allowedMethods: " <method-list> <method-list> ::= <method> | <method> <method-list>

NOTE: for AWS-hosted deployments using API Gateway, IAM policies and routes may also be used to restrict HTTP methods. See aws/guides/api-gateway.md for more details.

Path / Query Parameter Schemas

<path-parameter-schemas> ::= "- pathParameterSchemas: " <parameter-schema> <query-parameter-schemas> ::= "- queryParameterSchemas: " <parameter-schema>

As of 0.4.38, this is considered an alpha feature which may change in backwards-incompatible ways.

Currently, the supports JSON Schema features are:

type :
- string value must be a JSON string
- integer value must be a JSON integer
- number value must be a JSON number
format :
- reversible-pseudonym : value MUST be a reversible pseudonym generated by the proxy
pattern : a regex pattern to match against the value
enum : a list of values to match against the value

null/empty is valid for all types; you can use a pattern to restrict this further.

Response Schema

<response-schema> ::= "responseSchema: " <json-schema-filter>

See: Response Schema Specification below.

<transforms> ::= "transforms:" <transform-list> <transform-list> ::= <transform> | <transform> <transform-list>

For each Endpoint, rules specify a list of transforms to apply to the response content.

Transform Specification

<transform> ::= "- " <transform-type> <json-paths> [<encoding>]

Each transform is specified by a transform type and a list of JSON paths. The transform is applied to all portions of the response content that match any of the JSON paths.

Supported Transform Types:

NOTE: these are implementations of com.avaulta.gateway.rules.transforms.Transform class in the psoxy codebase.

Pseudonymize

Options:

includeReversible (default: false): If true, an encrypted form of the original value will be included in the result. This value, if passed back to the proxy in a URL, will be decrypted back to the original value before the request is forward to the data source. This is useful for identifying values that are needed as parameters for subsequent API requests. This relies on symmetric encryption using the ENCRYPTION_KEY secret stored in the proxy; if ENCRYPTION_KEY is rotated, any 'reversible' value previously generated will no longer be able to be decrypted by the proxy.
encoding (default: JSON): The encoding to use when serializing the pseudonym to a string.
- JSON - a JSON object structure, with explicit fields
- URL_SAFE_TOKEN - a string format that aims to be concise, URL-safe, and format-preserving for email case.

Pseudonymize Email Header

deals with CSV lists (multiple emails in a single header)
handles the name <email> format, in effect redacting the name and replacing with a pseudonym based only on normalized email

Redact

!<redact> - removes the matching values from the response.

Some extensions of redaction are also supported:

!<redactExceptSubstringsMatchingRegexes> - removes the matching values from the response except value matches one of the specified regex options. (Use case: preserving portions of event titles if match variants of 'Focus Time', 'No Meetings', etc)
!<redactRegexMatches> - redact content IF it matches one of the regexs included as an option.

Eg, the following redacts all headers that have a name value other than those explicitly listed below:

- !<redact>
  jsonPaths:
    - "$.messages.payload.headers[?(!(@.name =~ /^From|To|Cc|Bcc|X-Original-Sender|Delivered-To|Sender|Message-ID|Date|In-Reply-To|Original-Message-ID|References$/i))]"

Tokenize

!<tokenize> - replaces matching values it with a reversible token, which proxy can reverse to the original value using ENCRYPTION_KEY secret stored in the proxy in subsequent requests.

Use case are values that may be sensitive, but are opaque. For example, page tokens in Microsoft Graph API do not have a defined structure, but in practice contain PII.

Options:

regex a capturing regex to use to extract portion of value that needs to be tokenized.

Filter Tokens by Regex

Options:

delimiter - used to split the value into tokens; if not provided, the entire value is treated as a single token.
filters - in effect, combined via OR; tokens matching ANY of the filters is preserved in the value.

Response Schema Specification

See: https://json-schema.org/implementations.html#schema-generators

type - one of :
- object a JSON object
- array a JSON array
- string a JSON string
- number a JSON number, either integer or decimal.
- integer a JSON integer (not a decimal)
- boolean a JSON boolean
properties - for type == object, a map of field names to schema to filter field's value against (eg, another JsonSchemaFilter for the field itself)
items - for type == array, a schema to filter each item in the array against (again, a JsonSchemaFilter)
format - for type == string, a format to expect for the string value. As of v0.4.38, this is not enforced by the proxy.
$ref - a reference to a schema specified in the definitions property of the root schema.
definitions - a map of schema names to schemas of type JsonSchemaFilter; only supported at root schema of endpoint.

Example:

The following is for a User from the GitHub API, via graphql. See: https://docs.github.com/en/graphql/guides/forming-calls-with-graphql#the-graphql-endpoint

- pathTemplate: "/graphql"
  responseSchema:
    type: "object"
    properties:
      data:
        type: "object"
        properties:
          organization:
            type: "object"
            properties:
              samlIdentityProvider:
                type: "object"
                properties:
                  externalIdentities:
                    type: "object"
                    properties:
                      pageInfo:
                        type: "object"
                        properties:
                          hasNextPage:
                            type: "boolean"
                          endCursor:
                            type: "string"
                      edges:
                        type: "array"
                        items:
                          type: "object"
                          properties:
                            node:
                              type: "object"
                              properties:
                                guid:
                                  type: "string"
                                samlIdentity:
                                  type: "object"
                                  properties:
                                    nameId:
                                      type: "string"
                                user:
                                  type: "object"
                                  properties:
                                    login:
                                      type: "string"
              membersWithRole:
                type: "object"
                properties:
                  pageInfo:
                    type: "object"
                    properties:
                      hasNextPage:
                        type: "boolean"
                      endCursor:
                        type: "string"
                  edges:
                    type: "array"
                    items:
                      type: "object"
                      properties:
                        node:
                          type: "object"
                          properties:
                            email:
                              type: "string"
                            login:
                              type: "string"
                            id:
                              type: "string"
                            isSiteAdmin:
                              type: "boolean"
                            organizationVerifiedDomainEmails:
                              type: "array"
                              items:
                                type: "string"
      errors:
        type: "array"
        items:
          type: "object"
          properties:
            type:
              type: "string"
            path:
              type: "array"
              items:
                type: "string"
            locations:
              type: "array"
              items:
                type: "object"
                properties:
                  line:
                    type: "integer"
                  column:
                    type: "integer"
            message:
              type: "string"

Configuration

API Data Sanitization

API Connector Rules Syntax

Endpoint Specification

Allowed Methods

Path / Query Parameter Schemas

Response Schema

Transform Specification

Pseudonymize

Pseudonymize Email Header

Redact

Tokenize

Filter Tokens by Regex

Response Schema Specification

Bulk File Sanitization

Overview

Compression

Sanitization Rules

Column-Oriented Formats (ColumnarRules)

Pseudonymization

Additional Transformations

See Also

Record-Oriented Formats (RecordRules)

See Also

Mixing File Formats

Configuration

Custom Rules of Predefined Bulk Connector (preferred)

Custom Bulk Connector

Direct Configuration of RULES

JSON Filter

Motivations

{ } as "any valid leaf node"

Bulk File Sanitization

Overview

Compression

Sanitization Rules

Column-Oriented Formats (ColumnarRules)

Pseudonymization

Additional Transformations

See Also

Record-Oriented Formats (RecordRules)

See Also

Mixing File Formats

Configuration

Custom Rules of Predefined Bulk Connector (preferred)

Custom Bulk Connector

Direct Configuration of RULES

JSON Filter

Motivations

{ } as "any valid leaf node"

API Data Sanitization

API Connector Rules Syntax

Endpoint Specification

Allowed Methods

Path / Query Parameter Schemas

Response Schema

Transform Specification

Pseudonymize

Pseudonymize Email Header

Redact

Tokenize

Filter Tokens by Regex

Response Schema Specification

`{ }` as "any valid leaf node"

`{ }` as "any valid leaf node"