Psoxy can be used to sanitize bulk files (eg, CSV, NDSJON, etc), writing the result to another bucket.
You can automate a data pipeline to push files to an -input
bucket, which will trigger a Psoxy instance (GCP Cloud Function or AWS Lambda), which will read the file, sanitize it, and write the result to a corresponding -sanitized
bucket.
You should limit the size of files processed by proxy to 200k rows or less, to ensure processing of any single file finishes within the run time limitations of the host platform (AWS, GCP). There is some flexibility here based on the complexity of your rules and file schema, but we've found 200k to be a conservative target.
To improve performance and reduce storage costs, you should compress (gzip) the files you write to the -input
bucket. Psoxy will decompress gzip files before processing and then compress the result before writing to the -sanitized
bucket. Ensure that you set Content-Encoding: gzip
on all files in your -input
bucket to enable this behavior. Note that if you are uploading files via the web UI in GCP/AWS, it is not possible to set this metadata in the initial upload - so you cannot use compression in such a scenario.
The 'bulk' mode of Psoxy supports either column-oriented or record-oriented file formats.
To cater to column-oriented file formats (Eg .csv, .tsv), Psoxy supports a ColumnarRules
format for encoding your sanitization rules. This rules format provides simple/concise configuration for these cases, where more complex processing of repeated values / complex field types is required.
If your use-case is record oriented (eg, NDJSON
, etc), with nested or repeated fields, then you will likely need RecordRules
as an alternative.
The core function of the Proxy is to pseudonymize PII in your data. To pseudonymize a column, add it to columnsToPseudonymize
.
To avoid inadvertent data leakage, if a column specified to be pseudonymized is not present in the input data, the Proxy will fail with an error. This is to avoid simple column name typos resulting in data leakage.
To ease integration, the 'bulk' mode also supports a few additional common transformations that may be useful. These provide an alternative to using a separate ETL tool to transform your data, or modifying your existing data export pipelines.
Redaction
To redact a column, add it to columnsToRedact
. By default, all columns present in the input data will be included in the output data, unless explicitly redacted.
Inclusion
Alternatively to redacting columns, you can specify columnsToInclude
. If specified, only columns explicitly included will be included in the output data.
Renaming Columns
To rename a column, add it to columnsToRename
, which is a map from original name --> desired name. Renames are applied before pseudonymization.
This feature supports simple adaptation of existing data pipelines for use in Worklytics.
Rule structure is specified in ColumnarRules
.
As of Oct 2023, this is a beta feature
RecordRules
parses files as records, presuming the specified format. It performs transforms in order on each record to sanitize your data, and serializes the result back to the specified format.
eg.
Each transform
is a map from transform type --> to a JSONPath to which the transform should be applied. The JSONPath is evaluated from the root of each record in the file.
The above example rules applies two transforms. First, it redacts $.summary
- the summary
field at the root at of the record object. Second, it pseudonymizes $.email
- the email
field at the root of the record object.
transforms
itself is an ordered-list of transforms. The transforms should be applied in order.
CSV format is also supported, but in effect is converted to a simple JSON object before rules are applied; so JSON paths in transforms should all be single-level; eg, $.email
to refer to the email
column in the CSV.
Rule structure is specified in RecordRules
.
As of Oct 2023, this feature is in beta and may change in backwards incompatible ways
You can process multiple file formats through a single proxy instance using MultiTypeBulkDataRules
.
These rules are structured with a field fileRules
, which is a map from parameterized path template within the "input" bucket to one of the above rule types (RecordRules
,ColumnarRules
) to be applied to files matching that path template.
Path templates are evaluated against the incoming file (object) path in order, and the first match is applied to the file. If no templates match the incoming file, it will not be processed.
Worklytics' provided Terraform modules include default rules for expected formats for hris
, survey
, and badge
data connectors.
If your input data does not match the expected formats, you can customize the rules in one of the following ways.
NOTE: The configuration approaches described below utilized Terraform variables as provided by our gcp and aws template examples. Other examples may not support these variables; please consult the variables.tf
at the root of your configuration. If you are directly using Worklytics' Terraform modules, you can consult the variables.tf
in the module directory to see if these variables are exposed.
You can override the rules used by the predefined bulk connectors (eg hris
, survey
, badge
) by filling the custom_bulk_connector_rules
variable in your Terraform configuration.
This variable is a map from connector ID --> rules, with the rules encoded in HCL format (rather than YAML as shown above). An illustrative example:
This approach ONLY supports ColumnarRules
Rather than enabling one of the predefined bulk connectors providing in the worklytics-connector-specs
Terraform module, you can specify a custom connector from scratch, including your own rules.
This approach is less convenient than the previous one, as TODO documentation and deep-links for connecting your data to Worklytics will not be generated.
To create a Custom Bulk Connector, use the custom_bulk_connectors
variable in your Terraform configuration, for example:
The above example is for ColumnarRules
.
You can directly modify the RULES
environment variable on the Psoxy instance, by directly editing your instance's environment via your hosting provider's console or CLI. In this case, the rules should be encoded in YAML format, such as:
Alternatively, you can remove the environment variable from your instance, and instead configure a RULES
value in the "namespace" of your instance, in the AWS Parameter Store or GCP Secret Manager (as appropriate for your hosting provider).
This approach is useful for testing, but note that if you later run terraform apply
again, any changes you make to the environment variable may be overwritten by Terraform.
JSON Filter is inspired by JSON Schema, but with the goal to filter documents rather than validate. As such, the basic idea is that data nodes that do not match the filter schema are removed, rather than the whole document failing validation.
The goal of JsonFilter is that only data elements specified in the filter pass through.
Some differences:
required
properties are ignored. While in JSON schema, an object that was missing a "required" property is invalid, objects missing "required" properties in a filter will be preserved.
{ }
, eg, a schema without a type, is interpreted as any valid leaf node (eg, unconstrained leaf; everything that's not 'array' or 'object') - rather than any valid JSON.
Compatibility goals:
a valid JSON Schema is convertible to valid JSON filter (with JSON schema features not supported by JSON filter ignored)
{ }
as "any valid leaf node"compactness, esp when encoding filter as YAML. can put { }
instead of { "type": "string" }
flexibility; for filtering use-case, often you just care about which properties are/aren't passed, rather than 'string' vs 'number' vs 'integer'
"any valid JSON" is more common use case in validation than in filtering.
Psoxy supports specifying sanitization rule sets to use to sanitize data from an API. These can be configured by encoding a rule set in YAML and setting a parameter in your instance's configuration. See an example of rules for Zoom: zoom.yaml
.
If such a parameter is not set, a proxy instances selects default rules based on source kind, from the corresponding supported source.
You can configure custom rule sets for a given instance via Terraform, by adding an entry to the custom_api_connector_rules
map in your terraform.tfvars
file.
eg,
<ruleset> ::= "endpoints:" <endpoint-list>
<endpoint-list> ::= <endpoint> | <endpoint> <endpoint-list>
A ruleset is a list of API endpoints that are permitted to be invoked through the proxy. Requests which do not match a endpoint in this list will be rejected with a 403
response.
<endpoint> ::= <path-template> <allowed-methods> <path-parameter-schemas> <query-parameter-schemas> <response-schema> <transforms>
<path-template> ::= "- pathTemplate: " <string>
Each endpoint is specified by a path template, based on OpenAPI Spec v3.0.0 Path Template syntax. Variable path segments are enclosed in curly braces ({}
) and are matched by any value that does not contain an /
character.
See: https://swagger.io/docs/specification/paths-and-operations/
<allowed-methods> ::= "- allowedMethods: " <method-list>
<method-list> ::= <method> | <method> <method-list>
<method> ::= "GET" | "POST" | "PUT" | "PATCH" | "DELETE" | "HEAD"
If provided, only HTTP methods included in this list will be permitted for the endpoint. Given semantics of RESTful APIs, this allows an additional point to enforce "read-only" data access, in addition to OAuth scopes/etc.
NOTE: for AWS-hosted deployments using API Gateway, IAM policies and routes may also be used to restrict HTTP methods. See aws/guides/api-gateway.md for more details.
<path-parameter-schemas> ::= "- pathParameterSchemas: " <parameter-schema>
<query-parameter-schemas> ::= "- queryParameterSchemas: " <parameter-schema>
alpha - a parameter schema to use to validate path/query parameter values; if validation fails, proxy will return 403 forbidden response. Given the use-case of validating URL / query parameters, only a small subset of JSON Schema is supported.
As of 0.4.38, this is considered an alpha feature which may change in backwards-incompatible ways.
Currently, the supports JSON Schema features are:
type
:
string
value must be a JSON string
integer
value must be a JSON integer
number
value must be a JSON number
format
:
reversible-pseudonym
: value MUST be a reversible pseudonym generated by the proxy
pattern
: a regex pattern to match against the value
enum
: a list of values to match against the value
null
/empty is valid for all types; you can use a pattern to restrict this further.
<response-schema> ::= "responseSchema: " <json-schema-filter>
See: Response Schema Specification below.
<transforms> ::= "transforms:" <transform-list>
<transform-list> ::= <transform> | <transform> <transform-list>
For each Endpoint, rules specify a list of transforms to apply to the response content.
<transform> ::= "- " <transform-type> <json-paths> [<encoding>]
Each transform is specified by a transform type and a list of JSON paths. The transform is applied to all portions of the response content that match any of the JSON paths.
Supported Transform Types:
<transform-type> ::= "!<pseudonymizeEmailHeader>" | "!<pseudonymize>" | "!<redact>" | "!<redactRegexMatches>" | "!<tokenize>" | "<!filterTokenByRegex>" | "!<redactExceptSubstringsMatchingRegexes"
NOTE: these are implementations of com.avaulta.gateway.rules.transforms.Transform
class in the psoxy codebase.
!<pseudonymize>
- transforms matching values by normalizing them (triming whitespace; if appear to be emails, treating them as case-insensitive, etc) and computing a SHA-256 hash of the normalized value. Relies on SALT
value configured in your proxy environment to ensure the SHA-256 is deterministic across time and between sources. In the case of emails, the domain portion is preserved, although the hash is still based on the entire normalized value (avoids hash of alice@acme.com
matching hash of alice@beta.com
).
Options:
includeReversible
(default: false
): If true
, an encrypted form of the original value will be included in the result. This value, if passed back to the proxy in a URL, will be decrypted back to the original value before the request is forward to the data source. This is useful for identifying values that are needed as parameters for subsequent API requests. This relies on symmetric encryption using the ENCRYPTION_KEY
secret stored in the proxy; if ENCRYPTION_KEY
is rotated, any 'reversible' value previously generated will no longer be able to be decrypted by the proxy.
encoding
(default: JSON
): The encoding to use when serializing the pseudonym to a string.
JSON
- a JSON object structure, with explicit fields
URL_SAFE_TOKEN
- a string format that aims to be concise, URL-safe, and format-preserving for email case.
!<pseudonymizeEmailHeader>
- transforms matching values by parsing the value as an email header, in accordance with RFC 2822 and some typical conventions, and generating a pseudonym based only on the normalized email address itself (ignoring name, etc that may appear) . In particular:
deals with CSV lists (multiple emails in a single header)
handles the name <email>
format, in effect redacting the name and replacing with a pseudonym based only on normalized email
!<redact>
- removes the matching values from the response.
Some extensions of redaction are also supported:
!<redactExceptSubstringsMatchingRegexes>
- removes the matching values from the response except value matches one of the specified regex
options. (Use case: preserving portions of event titles if match variants of 'Focus Time', 'No Meetings', etc)
!<redactRegexMatches>
- redact content IF it matches one of the regex
s included as an option.
By using a negation in the JSON Path for the transformation, !<redact>
can be used to implement default-deny style rules, where all fields are redacted except those explicitly listed in the JSON Path expression. This can also redact object-valued fields, conditionally based on object properties as shown below.
Eg, the following redacts all headers that have a name value other than those explicitly listed below:
!<tokenize>
- replaces matching values it with a reversible token, which proxy can reverse to the original value using ENCRYPTION_KEY
secret stored in the proxy in subsequent requests.
Use case are values that may be sensitive, but are opaque. For example, page tokens in Microsoft Graph API do not have a defined structure, but in practice contain PII.
Options:
regex
a capturing regex to use to extract portion of value that needs to be tokenized.
!<filterTokenByRegex>
- tokenizes matching string values by a delimiter, if provided; and matches result against a list of filters
, removing any content that doesn't match at least one of the filters. (Use case: preserving Zoom URLs in meeting descriptions, while removing the rest of the description)
Options:
delimiter
- used to split the value into tokens; if not provided, the entire value is treated as a single token.
filters
- in effect, combined via OR; tokens matching ANY of the filters is preserved in the value.
A "response schema" is a "JSON Schema Filter" structure, specifying how response (which must be JSON) should be filtered. Using this, you can implement a "default deny" approach to sanitizing API fields in a manner that may be more convenient than using JSON paths with conditional negations (a redact transform with a JSON path that matches all but an explicit list of named fields is the other approach to implementing 'default deny' style rules).
Our "JSON Schema Filter" implementation attempts to align to the JSON Schema specification, with some variation as it is intended for filtering rather than validation. But generally speaking, you should be able to copy the JSON Schema for an API endpoint from its OpenAPI specification as a starting point for the responseSchema
value in your rule set. Similarly, there are tools that can generate JSON Schema from example JSON content, as well as from data models in various languages, that may be useful.
See: https://json-schema.org/implementations.html#schema-generators
If a responseSchema
attribute is specified for an endpoint
, the response content will be filtered (rather than validated) against that schema. Eg, fields NOT specified in the schema, or not of expected type, will be removed from the response.
type
- one of :
object
a JSON object
array
a JSON array
string
a JSON string
number
a JSON number, either integer or decimal.
integer
a JSON integer (not a decimal)
boolean
a JSON boolean
properties
- for type == object
, a map of field names to schema to filter field's value against (eg, another JsonSchemaFilter
for the field itself)
items
- for type == array
, a schema to filter each item in the array against (again, a JsonSchemaFilter
)
format
- for type == string
, a format to expect for the string value. As of v0.4.38, this is not enforced by the proxy.
$ref
- a reference to a schema specified in the definitions
property of the root schema.
definitions
- a map of schema names to schemas of type JsonSchemaFilter
; only supported at root schema of endpoint.
Example:
The following is for a User from the GitHub API, via graphql. See: https://docs.github.com/en/graphql/guides/forming-calls-with-graphql#the-graphql-endpoint