1 of 73

0.4.56 PSOXY

A serverless, pseudonymizing, DLP layer between Worklytics and the REST API of your data sources.

Psoxy replaces PII in your organization's data with hash tokens to enable Worklytics's analysis to be performed on anonymized data which we cannot map back to any identifiable individual.

Psoxy is a pseudonymization service that acts as a Security / Compliance layer, which you can deploy between your data sources (SaaS tool APIs, Cloud storage buckets, etc) and the tools that need to access those sources.

Psoxy ensures more secure, granular data access than direct connections between your tools will offer - and enforces access rules to fulfill your Compliance requirements.

Psoxy functions as API-level Data Loss Prevention layer (DLP), by blocking sensitive fields / values / endpoints that would otherwise be exposed when you connect a data sources API to a 3rd party service. It can ensure that data which would otherwise be exposed to a 3rd party service, due to granularity of source API models/permissions, is not accessed or transfered to the service.

Objectives:

serverless - we strive to minimize the moving pieces required to run psoxy at scale, keeping your attack surface small and operational complexity low. Furthermore, we define infrastructure-as-code to ease setup.
transparent - psoxy's source code is available to customers, to facilitate code review and white box penetration testing.
simple - psoxy's functionality will focus on performing secure authentication with the 3rd party API and then perform minimal transformation on the response (pseudonymization, field redaction) to ease code review and auditing of its behavior.

Psoxy may be hosted in or .

Data Flow

A Psoxy instances reside on your premises (in the cloud) and act as an intermediary between Worklytics and the data source you wish to connect. In this role, the proxy performs the authentication necessary to connect to the data source's API and then any required transformation (such as pseudonymization or redaction) on the response.

Orchestration continues to be performed on the Worklytics side.

Source API data may include PII such as:

But Psoxy ensures Worklytics only sees:

These pseudonyms leverage SHA-256 hashing / AES encryption, with salt/keys that are known only to your organization and never transferred to Worklytics.

For data sources APIs which require keys/secrets for authentication, such values remain stored in your premises and are never accessible to Worklytics.

You authorize your Worklytics tenant to access your proxy instance(s) via the IAM platform of your cloud host.

Supported Data Sources

As of March 2023, the following sources can be connected to Worklytics via psoxy.

Note: Some sources require specific licenses to transfer data via the APIs/endpoints used by Worklytics, or impose some per API request costs for such transfers.

Google Workspace (formerly GSuite)

If you use our provided Terraform modules, specific instructions that you can pass to the Google Workspace Admin will be output for you.

NOTE: 'Google Directory' connection is required prerequisite for all other Google Workspace connectors.

NOTE: you may need to enable the various Google Workspace APIs within the GCP project in which you provision the OAuth Clients. If you use our provided terraform modules, this is done automatically.

Microsoft 365

Other Data Sources via REST APIs

These sources will typically require some kind of "Admin" within the tool to create an API key or client, grant the client access to your organization's data, and provide you with the API key/secret which you must provide as a configuration value in your proxy deployment.

The API key/secret will be used to authenticate with the source's REST API and access the data.

Other Data Sources without REST APIs

Other data sources, such as Human Resource Information System (HRIS), Badge, or Survey data can be exported to a CSV file. The "bulk" mode of the proxy can be used to pseudonymize these files by copying/uploading the original to a cloud storage bucket (GCS, S3, etc), which will trigger the proxy to sanitize the file and write the result to a 2nd storage bucket, which you then grant Worklytics access to read.

Getting Started - Customers

Host Platform and Data Sources

The prequisites and dependencies you will need for Psoxy are determined by:

Where you will host psoxy? eg, Amazon Web Services (AWS), or Google Cloud Platform (GCP)
Which data sources you will connect to? eg, Microsoft 365, Google Workspace, Zoom, etc, as defined in previous sections.

Once you've gathered that information, you can identify the required software and permissions in the next section, and the best environment from which to deploy Psoxy.

Prerequisites

At a high-level, you need 3 things:

a cloud host platform account to which you will deploy Psoxy (eg, AWS account or GCP project)
an environment on which you will run the deployment tools (usually your laptop)
some way to authenticate that environment with your host platform as an entity with sufficient permissions to perform the deployment. (usually an AWS IAM Role or a GCP Service Account, which your personal AWS or Google user can assume).

You, or the IAM Role / GCP Service account you use to deploy Psoxy, usually does NOT need to be authorized to access or manage your data sources directly. Data access permissions and steps to grant those vary by data source and generally require action to be taken by the data source administrator AFTER you have deployed Psoxy.

Required Software and Permissions

As of Feb 2023, Psoxy is implemented with Java 11 and built via Maven. The proxy infrastructure is provisioned and the Psoxy code deployed using Terraform, relying on Azure, Google Cloud, and/or AWS command line tools.

You will need all the following in your deployment environment (eg, your laptop):

NOTE: we will support Java versions for duration of official support windows, in particular the LTS versions. As of Nov 2023, we still support java 11 but may end this at any time. Minor versions, such as 12-16, and 18-20, which are out of official support, may work but are not routinely tested.

NOTE: Refrain to use Terraform versions 1.4.x that are < v1.4.3. We've seen bugs.

Depending on your Cloud Host / Data Sources, you will need:

For testing your psoxy instance, you will need:

NOTE: NodeJS 16 is unmaintained since Oct 2023, so we recommend newer version; but in theory should work.

Setup

Choose the cloud platform you'll deploy to, and follow its 'Getting Started' guide:
Based on that choice, pick from the example template repos below. Use your choosen option as a template to create a new GitHub repo, or if you're not using GitHub Cloud, create clone/fork of the choosen option in your source control system:
- AWS - https://github.com/Worklytics/psoxy-example-aws
- GCP - https://github.com/Worklytics/psoxy-example-gcp
You will make changes to the files contained in this repo as appropriate for your use-case. These changes should be committed to a repo that is accessible to other members of your team who may need to support your Psoxy deployment in the future.
Pick the location from which you will deploy (provision) the psoxy instance. This location will need the software prereqs defined in the previous section. Some suggestions:
- your local machine; if you have the prereqs installed and can authenticate it with your host platform (AWS/GCP) as a sufficiently privileged user/role, this is a simple option
Follow the 'Setup' steps in the READMEs of those repos, ultimately running terraform apply to deploy your Psoxy instance(s).
follow any TODO instructions produced by Terraform, such as:
- provision API keys / make OAuth grants needed by each Data Connection
- create the Data Connection from Worklytics to your psoxy instance (Terraform can provide TODO file with detailed steps for each)
Various test commands are provided in local files, as the output of the Terraform; you may use these examples to validate the performance of the proxy. Please review the proxy behavior and adapt the rules as needed. Customers needing assistance adapting the proxy behavior for their needs can contact support@worklytics.co

Component Status

Support

Overview

Psoxy is a serverless, pseudonymizing, Data Loss Prevention (DLP) layer between Worklytics and your data sources. It acts as a Security / Compliance layer, which you can deploy between your data sources (SaaS tool APIs, Cloud storage buckets, etc) and Worklytics.

Benefits include:

Granular authorization on the API endpoint, parameter, and field-levels to your Sources. Eg, limit Worklytics to calling ONLY an explicit subset of an APIs endpoints, with an explicit set of possible parameters, and receiving ONLY a subset fields in response.
no API keys for your data sources are ever sent or held by Worklytics.
any PII present in your data can be pseudonymized before being sent to Worklytics
sensitive data can be redacted before being sent to Worklytics

Modes

Psoxy can be deployed/used in 3 different modes:

API - psoxy sits in front of a data source API. Any call that would normally be sent to the data source API is instead sent to psoxy, which parses the request, validates it / applies ACL, and adds authentication before forwarding to the host API. After the host API response, psoxy sanitizes the response as defined by its roles before returning the response to the caller. This is an http triggered flow.
Bulk File - psoxy is triggered by files (objects) being uploaded to cloud storage buckets (eg, S3, GCS, etc). Psoxy reads the incoming file, applies one or more sanitization rules (transforms), writing the result(s) to a destination (usually in distinct bucket).
Command-line (cli) - psoxy is invoked from the command-line, and is used to sanitize data stored in files on the local machine. This is useful for testing, or for one-off data sanitization tasks. Resulting files can be uploaded to Worklytics via the file upload of its web portal.

Layers of Data Protection

Data transfer via Psoxy provides a layered approach to data protection, with various redundancies against vulnerabilities / misconfigurations to controls implemented at each layer.

Data source API authorization The API of your data source limit the data which your proxy instance can access to a set of oauth scopes. Typically, these align to a set of API endpoints that a given authentication credential is authorized to invoke. In some cases, oauth scopes may limit the fields returned in responses from various endpoints.
Host Platform ACL (IAM) Your proxy instances will be hosted in your preferred cloud hosting provider (eg, AWS, GCP) and access restricted per your host's ACL capabilities. Typically, this means only principals (to borrow AWS's parlance, eg users/roles/etc) which you authorize via an IAM policy can invoke your proxy instances. Apart from limiting who can access data via you proxy instance, IAM rules can enforce read-only access to RESTful APIs by limited the allowed HTTP methods to GET/HEAD/etc.
Proxy-level ACL Psoxy itself offers a sophisticated set of access restriction rules, including limiting access by: - HTTP method (eg, limit to GET/HEAD to ensure read-only access) - API endpoint (eg, limit access to /files/{fileId}/metadata) - API parameter (eg, allow only page,pageSize as parameters)
Proxy-level response transformation Psoxy can be configured to sanitize fields in API responses, including:
- pseudonymizing/tokenizing fields that include PII or sensitive identifiers
- redacting fields containing sensitive information or which aren't needed for analysis

Together, these layers of data protection can redundantly control data access. Eg, you could ensure read-only access to GMail metadata by:

granting the Gmail metadata-only oauth scope to your instance via the Google Workspace Admin console, instead of the full Gmail API scope
restricting only GET requests to your proxy instance via AWS IAM policy
configure rules in your Proxy instance allow only GET requests to be sent to Gmail API via your instances; and only to eht /gmail/v1/users/{mailboxId}/messages and /gmail/v1/users/{mailboxId}/messages/{messageId} endpoints
configure rules in your Proxy instance that filter responses to an explicit set of metadata fields form those endpoints

This example illustrates how the proxy provides data protection across several redundant layers, each provided by different parties. Eg:

you trust Google to correctly implement their oauth scopes and API access controls to limit the access to gmail metadata
you trust AWS to correctly implement their IAM service, enforcing IAM policy to limit data access to the methods and principals you configure.
you trust the Psoxy implementation, which is source-available for your review and testing, to properly implement its specified rules/functionality.
you trust Worklytics to implement its service to not store or process non-metadata fields, even if accessible.

You can verify this trust via the logging provided by your data source (API calls received), your cloud host (eg, AWS cloud watch logs include API calls made via the proxy instance), the psoxy testing tools to simulate API calls and inspect responses, and Worklytics logs.

Authentication

Install Prerequisites

These shell command examples presume Ubuntu; you may need to translate to your *nix variant. If you starting with a fairly rich environment, many of these tools may already be on your machine.

install dependencies

sudo apt update

install Java + maven (required to build the proxy binary to be deployed)

sudo apt install openjdk-11-jdk
sudo apt install maven

# check that maven version at least 3.6+ and java 11+
mvn -v

# if not, get latest direct from Apache Maven
# https://maven.apache.org/install.html

install Terraform

Follow Terraform's install guide (recommended) or, if you need to manage multiple Terraform versions, use tfenv:

git clone https://github.com/tfutils/tfenv.git ~/.tfenv
mkdir ~/bin
sudo apt install unzip
ln -s ~/.tfenv/bin/* ~/bin/
tfenv install
tfenv use latest

if you're deploying in AWS, install the AWS CLI

sudo apt install awscli

if you want to test an AWS deployment, install AWS Curl (which requires python 3.6+ and pip)

# check python version; please ensure it's at least 3.6+
python --version

# if not 3.6+, get latest direct for your environment from Python.org

install pip (likely included with fresh python install), then use that to install awscurl

sudo apt install pip
pip install awscurl

if deploying to GCP or using Google Workspace data sources, install Google Cloud CLI and authenticate.

curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-384.0.0-linux-x86_64.tar.gz
tar -xvf google-cloud-cli-384.0.0-linux-x86_64.tar.gz
sudo ./google-cloud-sdk/install.sh

# if prompted for location of your .bashrc by the gcloud install script, on EC2 it's '/home/ubuntu/.bashrc'

rm google-cloud-cli-384.0.0-linux-x86_64.tar.gz

# source your .bashrc OR restart your terminal so gcloud is found on your $PATH
source ~/.bashrc

# authenticate with Google Cloud CLI
gcloud auth application-default login --no-launch-browser

if using Microsoft 365 data sources, install Azure CLI and authenticate.

https://docs.microsoft.com/en-us/cli/azure/install-azure-cli

You should now be ready for the general instructions in the README.md.

AWS

Getting Started

Overview

You'll provision the following to host Psoxy in AWS:

Lambda Functions
IAM Roles and Policies
System Manager Parameter Store Parameters
Cloud Watch Log Groups
S3 buckets, if using the 'bulk' mode to sanitize file data (such as CSVs)
Cognito Pools and Identities, if connecting to Microsoft 365 (Azure AD) data sources

The diagram below provides an architecture overview of the 'REST' and 'Bulk' use-cases.

Prerequisites

An AWS Account in which to deploy Psoxy We strongly recommend you provision one specifically for use to host Psoxy, as this will create an implicit security boundary, reduce possible conflicts with other infra configured in the account, and simplify eventual cleanup.
You will need the numeric AWS Account ID for this account, which you can find in the AWS Console.
If your AWS organization enforces Service Control Policies, ensure that these are allow the AWS components required by Psoxy or exempt the AWS Account in which you will deploy Psoxy from these policies.
If your organization uses any sort of security control enforcement mechanism, you may have disable/provide exceptions to those controls for you initial deployment. Then generally those controls can be implemented later by extending our examples. Our protips page provides some guidance on how to extend the base examples to meet more extreme requirements.
A sufficiently privileged AWS Role You must have a IAM Role within the AWS account with sufficient privileges to (AWS managed policy examples linked):
- create IAM roles + policies (eg IAMFullAccess)
- create and update Systems Manager Parameters (eg, AmazonSSMFullAccess )
- create and manage Lambdas (eg AWSLambda_FullAccess )
- create and manage S3 buckets (eg AmazonS3FullAccess )
- create Cloud Watch Log groups (eg CloudWatchFullAccess)
(Yes, the use of AWS Managed Policies results in a role with many privileges; that's why we recommend you use a dedicated AWS account to host proxy which is NOT shared with any other use case)
You will need the ARN of this role.
NOTE: if you're connecting to Microsoft 365 (Azure AD) data sources, you'll also need permissions to create AWS Cognito Identity Pools and add Identities to them, such as arn:aws:iam::aws:policy/AmazonCognitoPowerUser. Some AWS Organizations have Service Control Policies in place that deny this by default, even if you have an IAM role that allows it at an account level.
An authenticated AWS CLI in your provisioning environment. Your environment (eg, shell/etc from which you'll run terraform commands) must be authenticated as an identity that can assume that role. (see next section for tips on options for various environments you can use)
Eg, if your Role is arn:aws:iam::123456789012:role/PsoxyProvisioningRole, the following should work:

aws sts assume-role --role-arn arn:aws:iam::123456789012:role/PsoxyProvisioningRole --role-session-name tf_session

If not, use `aws sts get-caller-identity` to confirm how your CLI is authenticated.

Provisioning Environment

To provision AWS infra, you'll need the aws-cli installed and authenticated on the environment where you'll run terraform.

Here are a few options:

Your Local Machine or a VM/Container Outside AWS

Generate an AWS Access Key for your AWS User.
Run aws configure in a terminal on the machine you plan to use, and configure it with the key you generated in step one.

NOTE: this could even be a GCP Cloud Shell, which may simplify auth if your wish to connect your Psoxy instance to Google Workspace as a data source.

EC2 Instance

If your organization prefers NOT to authorize the AWS CLI on individual laptops and/or outside AWS, provisioning Psoxy's required infra from an EC2 instance may be an option.

provision an EC2 instance (or request that your IT/dev ops team provision one for you). We recommend a micro instance with an 8GB disk, running ubuntu (not Amazon Linux; if you choose that or something else, you may need to adapt these instructions). Be sure to create a PEM key to access it via SSH (unless your AWS Organization/account provides some other ssh solution).
associate the Role above with your instance (see https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html)
connect to your instance,

# avoid ssh complaints about permissions on your key
chmod 400 psoxy-access-key.pem

ssh -i ~/psoxy-access-key.pem ubuntu@{PUBLIC_IPV4_DNS_OF_YOUR_EC2_INSTANCE}

Whichever environment you choose, follow general prereq installation, and, when ready, continue with README.

Terraform State Backend

S3

You'll also need a backend location for your Terraform state (such as an S3 bucket). It can be in any AWS account, as long as the AWS role that you'll use to run Terraform has read/write access to it.

See https://developer.hashicorp.com/terraform/language/settings/backends/s3 for more details.

Local File System

Alternatively, you may use a local file system, but this is not recommended for production use - as your Terraform state may contain secrets such as API keys, depending on the sources you connect.

See https://developer.hashicorp.com/terraform/language/settings/backends/local

Bootstrap

The module psoxy-constants is a dependency-free module that provides lists of AWS managed policies, etc needed for bootstraping a AWS account in which your proxy instances will reside.

Getting Started

Once you've fulfilled the prereqs, including having your terraform deployment environment, backend, and AWS account prepared, we suggest you use our AWS example template repo:

https://github.com/Worklytics/psoxy-example-aws

Follow the 'Usage' instructions there to continue.

Authentication and Authorization

This page provides an overview of how Psoxy authenticates and confirms authorization of clients (Worklytics tenants).

For general overview of how Psoxy is authorized to access data sources, and authenticates when making API calls to those sources, see API Mode Authentication and Authorization.

Authentication

Each Worklytics tenant operates as a unique GCP service account within Google Cloud. GCP issues an identity token for this service account to processes running in the tenant, which the tenant then uses to authenticate against AWS.

This is OIDC based identity federation (aka "web identity federation" or "workload identity federation").

No secrets or keys need to be exchanged between Worklytics and your AWS instance. The integrity of the authentication is provided by the signature of the identity token provided by GCP, which AWS verifies against Google's public certificates.

AWS provides an overview of the specific GCP Case: Access AWS using a Google Cloud Platform native workload identity

Annotating the diagram for the above case, with specific components for Worklytics-->Proxy case:

In the above, the AWS resource you're allowing access to is AWS IAM role, which your Worklytics tenant assumes and then can access S3 or invoke Lambda function.

Authorization

Within your AWS account, you create an IAM role, with a role assumption policy that allows your Worklytics tenant's GCP Service Account (identified by a numeric ID you obtain from the Worklytics portal) to assume the role.

This assumption policy will have a statement similar to the following, where the value of the aud claim is the numeric ID of your Worklytics tenant's GCP Service Account:

{
	"Effect": "Allow",
	"Principal": {
		"Federated": "accounts.google.com"
	},
	"Action": "sts:AssumeRoleWithWebIdentity",
	"Condition": {
		"StringEquals": {
			"accounts.google.com:aud": "12345678901234567890123456789"
		}
	}
}

Colloquially, this allows a web identity federated from accounts.google.com where Google has asserted the claim that aud == 12345678901234567890123456789 to assume the role.

Then you use this AWS IAM role as the principal in AWS IAM policies you define to authorize to invoke your proxy instances via their function URLs (API connectors) or to read from their sanitized output buckets (bulk data connectors)

See: https://github.com/Worklytics/psoxy/blob/v0.4.40/infra/modules/aws/main.tf#L81-L102

Getting Started with Cloud Shell

YMMV; as of June 2023, AWS's 1GB limit on cloud shell persistent storage is too low for real world proxy deployments, which typically require install gcloud CLI / Azure CLI to connect to sources

So use use your local machine, or a VM/container elsewhere in AWS (EC2, AWS Cloud9, etc

clone the repo

git clone https://github.com/Worklytics/psoxy.git

add the following lines to your ~/.bashrc. (AWS Cloud Shell preserves only your HOME directory across sessions, so add any commands that modify/install things outside to your .bashrc)


# install Maven (and, via dependency, java)
sudo yum -y install maven

# GCloud SDK (if using Google Workspace data sources)
# The next line updates PATH for the Google Cloud SDK.
if [ -f '/home/cloudshell-user/google-cloud-sdk/path.bash.inc' ]; then . '/home/cloudshell-user/google-cloud-sdk/path.bash.inc'; fi

# The next line enables shell command completion for gcloud.
if [ -f '/home/cloudshell-user/google-cloud-sdk/completion.bash.inc' ]; then . '/home/cloudshell-user/google-cloud-sdk/completion.bash.inc'; fi

Then source ~/.bashrc, to execute the above.

install Terraform

git clone https://github.com/tfutils/tfenv.git ~/.tfenv
mkdir ~/bin
ln -s ~/.tfenv/bin/* ~/bin/
tfenv install
tfenv use latest

if using Google Workspace data sources, install Google Cloud CLI and authenticate.

curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-435.0.1-linux-x86_64.tar.gz
tar -xvf google-cloud-cli-435.0.1-linux-x86_64.tar.gz
sudo ./google-cloud-sdk/install.sh
rm google-cloud-cli-435.0.1-linux-x86_64.tar.gz

if using Microsoft 365 data sources, install Azure CLI and authenticate.

https://docs.microsoft.com/en-us/cli/azure/install-azure-cli

You should now be ready for the general instructions in the README.md.

Other stuff

If default NodeJS tooling doesn't work for you, legacy testing tools use python/awscurl, installed via pip. See example below:


# install AWS Curl (used for testing)
sudo yum -y install pip
pip install awscurl

Encryption Keys in AWS

As of June 2023, the following resources provisioned by Psoxy modules support use of CMEKs:

Lambda function environment variables
SSM Parameters
Cloud Watch Log Groups
S3 Buckets

Pre-existing Key

The psoxy-example-aws example provides a project_aws_key_arn variable, that, if provided, will be set as the encryption key for these resources. A few caveats:

The AWS principal your Terraform is running as must have permissions to encrypt/decrypt with the key (it needs to be able to read/write the lambda env, ssm params, etc)
The key should be in the same AWS region you're deploying to.
CloudWatch must be able to use the key, as described in

In example-dev/aws-all/kms-cmek.tf, we provide a bunch of lines that you can uncomment to use encryption on S3 and properly set key policy to support S3/CloudWatch use.

For production use, you should adapt the key policy to your environment and scope as needed to follow your security policies, such as principle of least privilege.

Provisioning a Key

More options

If you need more granular control of CMEK by resource type, review the main.tf and variables exposed by the aws-host module for some options.

Protips

Some ideas on how to support scenarios and configuration requirements beyond what our default examples show:

Encryption Keys

see

Tagging ALL infra created by your Terraform Configuration

If you're using our AWS example, it should support a default_tags variable.

You can add the following in your terrform.tfvars file to set tags on all resources created by the example configuration:

If you're not using our AWS example, you can add the following to your configuration, then you will need to modify the aws provider block in your configuration to add a default_tags. Example shown below:

See: [https://registry.terraform.io/providers/hashicorp/aws/latest/docs#default_tags]

Extensibility

To support extensibility, our Terraform examples/modules output the IDs/names of the major resources they create, so that you can compose them with other Terraform resources.

Buckets

The aws-host module outputs bulk_connector_instances; a map of id => instance for each bulk connector. Each of these has two attributes that correspond to the names of its related buckets:

sanitized_bucket_name
input_bucket_name

So in our AWS example, you can use these to enable logging, for example, you could do something like this: (YMMV, syntax etc should be tested)

Analogous approaches can be used to configure versioning, replication, etc;

Note that encryption, lifecycle, public_access_block are set by the Workltyics-provided modules, so you may have conflicts issues if you also try to set those outside.

Lambda Execution Role

beta - released from v0.4.50; YMMV, and may be subject to change.

The terraform modules we provide provision execution roles for each lambda function, and attach by default attach the appropriate AWS Managed Policy to each.

For organizations that don't allow use of AWS Managed Policies, you can use the aws_lambda_execution_role_policy_arn variable to pass in an alternative which will be used INSTEAD of the AWS Managed Policy.

Guides

- brittle! YMMV.

Using API Gateway

Some organizations require use of API Gateway. This is not the default approach for Psoxy since AWS added support for Lambda Function URLs (March 2022), which are a simpler and more direct way to expose lambdas via HTTPS.

Nonetheless, should you wish to use API Gateway we provide beta support for this. It is needed if you wish to put your Lambda functions on a VPC (See lambdas-on-vpc.md).

In particular:

IAM policy that allows api gateway methods to be invoked by the proxy caller role is defined once, using wildcards, and exposes GET/HEAD/POST methods for all resources. While methods are further constrained by routes and the proxy rules themselves, this could be another enforcement point at the infrastructure level - at expense of N policies + attachments in your terraform plan instead of 1.
proxy instances exposed as lambda function urls have 55s timeout, but API Gateway seems to support 30s as max - so this may cause timeouts in certain APIs

Usage

Prerequisites:

the AWS principal (user or role) to provision API gateways. The AWS managed policy provides this.

Add the following to your terraform.tfvars file:

Then terraform apply should create of API gateway-related resources, including policies/etc; and destroy lambda function urls (if you've previously applied with use_api_gateway=false, which is the default).

API Gateway v1 - not supported, but FWIW

If you wish to use API Gateway V1, you will not be able to use the flag above. Instead, you'll have to do something like the following:

Additionally, you'll need to set a different handler class to be invoked instead of the default (co.workltyics.psoxy.Handler, should be co.worklytics.psoxy.APIGatewayV1Handler). This can be done in Terraform or by modifying configuration via AWS Console.

Using AWS Secrets Manager

By default, Psoxy uses AWS Systems Manager Parameter Store to store secrets; this simplifies configuration and minimizes costs. However, you may want to use AWS Secrets Manager to store secrets due to organization policy.

In such a case, you can add the following to your terraform.tfvars file:

This will alter the behavior of the Terraform modules to store everything considered a secret to be stored/loaded from AWS Secrets Manager instead of AWS Systems Manager Parameter Store. Note that Parameter Store is still used for non-secret configuration information, such as proxy rules, etc.

Changes will also be made to AWS IAM Policies, to allow lambda function execution roles to access Secrets Manager as needed.

If any secrets are managed outside of Terraform (such as API keys for certain connectors), you will need to grant access to relevant secrets in Secrets Manager to the principals that will manage these.

Lookup Tables

If you use Psoxy to send pseudonymized data to Worklytics and later wish to re-identify the data that you export from Worklytics to your premises, you'll need a lookup table in your data warehouse to JOIN with that data.

Our aws-host Terraform module, as used in our , provides a variable lookup_table_builders to control generation of these lookup tables.

Populating this variable will generate another version of your HRIS data (aside from the one exposed to Worklytics) which you can then import back to your data warehouse.

To enable it, add the following to your terraform.tfvars file:

In sanitized_accessor_role_names, add the name of whatever AWS role that the principal running ingestion of your lookup table from S3 to your data warehouse will assume. You can add additional role names as needed. Alternatively, you can use an IAM policy created outside of our Terraform module to grant access to the lookup table CSVs within the S3 bucket.

After you apply this configuration, the lookup table will be generated in an S3 bucket. The S3 bucket will be shown in the Terraform output:

Use the bucket name shown in your output to build import pipeline to your data warehouse.

Every time a new hris snapshot is uploaded to the hris -input bucket, TWO copies of it will be created: a sanitized copy in the bucket accessible Worklytics, and the lookup variant in the lookup bucket referenced above (not accessible to Worklytics).

The lookup table CSV file will have the following columns: EMPLOYEE_ID,EMPLOYEE_ID_ORIG

If you load this into your Data Warehouse, you can JOIN it with the data you export from Worklytics.

Then the following query will give re-identified aggregate data:

The employeeId column in the result set will be the original employee ID from your HRIS system.

Security and Privacy Considerations

If your HRIS employee ID column is considered PII, then the lookup table and any re-identified data exports you use it to produce should be handled as Personal data, according to your policies, as these now reference readily identifiable Natural Persons.

If you wish limit re-identification to a subset of your data, you can use additional columns present in your HRIS csv to do so, for example:

Advanced

Within the lookup_table_builders map, you can specify the following fields:

input_connector_id - usually hris; this corresponds the whatever bulk connector you want to build the lookup table for.
rules - this follows the rules structure for the bulk connector case. The example above is suited for HRIS data following the schema expected by Worklytics. If you modify this, be sure to review our documentation or contact support to ensure you don't break your lookup table.

Lambdas on a VPC

beta - This is now available for customer-use, but may still change in backwards incompatible ways.

Our aws-host module provides a vpc_config variable to specify the VPC configuration for the lambdas that our Terraform modules will create, analogous to the block supported by the AWS lambda terraform resource.

Some caveats:

API connectors on a VPC must be exposed via rather than (our Terraform modules will make this change for you).
VPC must be configured such that your lambda has connectivity to AWS services including S3, SSM, and CloudWatch Logs; this is typically done by adding a for each service.
VPC must allow any API connector to connect to data source APIs via HTTPS (eg 443); usually these APIs are on the public internet, so this means egress to public internet.
VPC must allow your API gateway to connect to your lambdas.

The requirements above MAY require you to modify your VPC configuration, and/or the security groups to support proxy deployment. The example we provide in our should fulfill this if you adapt it; or you can use it as a reference to adapt you existing VPC.

To put the lambdas created by our terraform example under a VPC, please follow one of the approaches documented in the next sections.

Usage - Bring-your-own VPC

If you have an existing VPC, you can use it with the vpc_config variable by hard coding the ids of the pre-existing resources (provisioned outside the scope of your proxy's terraform configuration).

Usage - with `vpc.tf`

Prerequisites:

the AWS principal (user or role) you're using to run Terraform must have permissions to manage VPCs, subnets, and security groups. The AWS managed policy AmazonVPCFullAccess provides this.

NOTE: if you provide vpc_config, the value you pass for use_api_gateway_v2 will be ignored; using a VPC requires API Gateway v2, so will override value of this flag to true.

Add the following to "psoxy" module in your main.tf (or uncomment if already present):

Uncomment the relevant lines in vpc.tf in the same directory, and modify as you wish. This file pulls the default VPC/subnet/security group for your AWS account under terraform.

Alternatively, you modify vpc.tf to use a provision non-default VPC/subnet/security group, and reference those from your main.tf - subject to the caveats above.

See the following terraform resources that you'll likely need:

Troubleshooting

Check your Cloud Watch logs for the lambda. Proxy lambda will time out in INIT phase if SSM Parameter Store or your secret store implementation (AWS Secrets Manager, Vault) is not reachable.

Some potential causes of this:

DNS failure - it's going to look up the SSM service by domain; if the DNS zone for the SSM endpoint you've provisioned is not published on the VPC, this will fail; similarly, if the endpoint wasn't configured on a subnet - then it won't have an IP to be resolved.
if the IP is resolved, you should see failure to connect to it in the logs (timeouts); check that your security groups for lambda/subnet/endpoint allow bidirectional traffic necessary for your lambda to retrieve data from SSM via the REST API.

Switching back from using a VPC

Terraform with aws provider doesn't seem to play nice with lambdas/subnets; the subnet can't be destroyed w/o destroying the lambda, but terraform seems unaware of this and will just wait forever.

So:

destroy all your lambdas (terraform state list | grep aws_lambda_function; then terraform destroy --target= for each, remember '' as needed)
destroy the subnet terraform destroy --target=aws_subnet.main

References

https://docs.aws.amazon.com/lambda/latest/dg/foundation-networking.html
https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc.html

Least-Privileged Provisioning Role

beta - we're not committed that maintaining this under versioning policy; minor proxy iterations may require changes to privileges required in the least-privileged role.

This is a guide about how to create a role for provisioning psoxy infrastructure in AWS, following the principle of least-privilege at permission-level, rather than policy-level.

Eg, as of v0.4.55 of the proxy, our docs provide guidance on using an AWS role to provision your psoxy infrastructure using the least-privileged set of AWS managed policies possible. A stronger standard would be to use a custom IAM policy rather than AWS managed policy, with the least-privileged set of permissions required.

Additionally, you can specify resource constraints to improve security within a shared AWS account. (However, we do not recommend or officially support deployment into a shared AWS account. We recommend deploying your proxy instances in isolated AWS account to provide an implicit security boundary by default, as an additional layer of protection beyond those provided by our proxy modules)

We provide an example IAM policy document in our psoxy-constants module that you can use to create a IAM policy in AWS. You can do this outside terraform, finding the JSON from that policy OR via terraform as follows:

AWS Troubleshooting

Tips and tricks for using AWS as to host the proxy.

Who are you?

If above doesn't happen seem to work as expected, some ideas in the next section may help.

Your AWS Organization uses SSO via Okta or some similar provider

Options:

execute terraform via
find credentials output by your SSO helper (eg, aws-okta) then fill the AWS CLI env variables yourself:

if your SSO helper fills default AWS credentials file but simply doesn't set the env vars, you may be able to export the profile to AWS_PROFILE, eg

References: https://discuss.hashicorp.com/t/using-credential-created-by-aws-sso-for-terraform/23075/7

Your AWS User has MFA

Options:

Logs via Cloud Watch

via Web Console

Log into AWS web console
navigate to the AWS account that hosts your proxy instance (you may need to assume a role in that account)
then the region in that account in which your proxy instance is deployed. (default us-east-1)
then search or navigate to the AWS Lambdas feature, and find the specific one you wish to debug
find the tabs for Monitoring then within that, Logging, then click "go to Cloud Watch"

via CLI

Unless your AWS CLI is auth'd as a user who can review logs, first auth it for such a role.

You can do this with a new profile, or setting env variables as follows:

Then, you can do a series of commands as follows:

Errors in Terraform apply

error creating Lambda Function URL

Something like the following:

Your Terraform state is inconsistent. Run something like the following, adapted for your connector:

NOTE: you likely need to change outlook-mail if your error is with a different data source. The \ chars are needed to escape the double-quotes/brackets in your bash command.

Permissions Errors

error reading SSM Parameters

Something like the following:

Check:

the SSM parameter exists in the AWS account
the SSM parameter can be decrypted by the lambda's execution role (if it's encrypted with a KMS key)

Setting IS_DEVELOPMENT_MODE to "true" in the Lambda's Env Vars via the console can enable some additional logging with detailed SSM error messages that will be helpful; but note that some of these errors will be expected in certain configurations.

Our Terraform examples should provide both of the above for you, but worth double-checking.

If those are present, yet the error persists, it's possible that you have some org-level security constraint/policy preventing SSM parameters from being used / read. For example, you have a "default deny" policy set for SSM GET actions/etc. In such a case, you need to add the execute roles for each lambda as exceptions to such policies (find these under AWS --> IAM --> Roles).

AWS Development

Prereqs

Required:

Optional:

AWS SAM CLI () for local testing, if desired
for direct testing of deployed AWS lambda from a terminal

Build

Maven build produces a zip file.

Build core library
From java/impl/aws/:

Run Locally

Locally, you can test function's behavior from invocation on a JSON payload (but not how the API gateway will map HTTP requests to that JSON payload):

https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-using-invoke.html

Deploy to AWS

We recommend deploying your Psoxy code into AWS using the terraform modules found in [infra/modules/](../../infra/modules/] for AWS. These modules both provision the required AWS infrastructure, as well as deploying the built binaries for Psoxy as lambdas in the target account.

You'll ultimately provision infrastructure represented in green in the following diagram:

![AWS data flow](./2022-02 Psoxy Data Flow.png)

GCP

getting-started.md

Getting Started

Overview

You'll provision infrastructure that ultimately looks as follows:

This includes:

Cloud Functions
Service Accounts
Secret Manager Secrets, to hold pseudonymization salt, encryption keys, and data source API keys
Cloud Storage Buckets (GCS), if using psoxy to sanitize bulk file data, such as CSVs

NOTE: if you're connecting to Google Workspace as a data source, you'll also need to provision Service Account Keys and activate Google Workspace APIs.

Prerequisites

a Google Project
- we recommend a dedicated GCP project for your deployment, to provide an implicit security boundary around your infrastructure as well as simplify monitoring/cleanup
a GCP (Google) user or Service Account with permissions to provision Service Accounts, Secrets, Storage Buckets, Cloud Functions, and enable APIs within that project. eg:
additional APIs enabled in the project: (using the Service Usage API above, our Terraform will attempt to enable these, but as there is sometimes a few minutes delay in activation and in some cases they are required to read your existing infra prior to apply, you may experience errors. To pre-empt those, we suggest ensuring the following are enabled:

Terraform State Backend

You'll also need a secure backend location for your Terraform state (such as a GCS or S3 bucket). It need not be in the same host platform/project/account to which you are deploying the proxy, as long as the Google/AWS user you are authenticated as when running Terraform has permissions to access it.

Some options:

GCS : https://developer.hashicorp.com/terraform/language/settings/backends/gcs
S3 : https://developer.hashicorp.com/terraform/language/settings/backends/s3

Alternatively, you may use a local file system, but this is not recommended for production use - as your Terraform state may contain secrets such as API keys, depending on the sources you connect.

See: https://developer.hashicorp.com/terraform/language/settings/backends/local

Bootstrap

Example

The https://github.com/Worklytics/psoxy-example-gcp repo provides an example configuration for hosting proxy instances in GCP. You use that template, following it's Usage docs to get started.

Security Considerations

the 'Service Account' approach described in the prerequisites is preferable to giving a Google user account IAM roles to administer your infrastructure directly. You can pass this Service Account's email address to Terraform by setting the gcp_terraform_sa_account_email. Your machine/environments CLI must be authenticated as GCP entity which can impersonate this Service Account, and likely create tokens as it (Service Account Token Creator role).
using a dedicated GCP project is superior to using a shared project, as it provides an implicit security boundary around your infrastructure as well as simplifying monitoring/cleanup. The IAM roles specified in the prerequisites must be granted at the project level, so any non-Proxy infrastructure within the GCP project that hosts your proxy instances will be accessible to the user / service account who's managing the proxy infrastructure.

Authentication and Authorization

This page provides an overview of how psoxy authenticates and confirms authorization of clients (Worklytics tenants) to access data for GCP-hosted deployments.

For general overview of how Psoxy is authorized to access data sources, and authenticates when making API calls to those sources, see .

Authentication

As Worklytics tenants run inside GCP, they are implicitly authenticated by GCP. No secrets or keys need be exchanged between your Worklytics tenant and your Psoxy instance. GCP can verify the identity of requests from Worklytics to your instance, just as it does between any process and resource within GCP.

Authorization

Invocations of your proxy instances are authorized by the IAM policies you define in GCP. For API connectors, you grant the Cloud Function Invoker role to your Worklytics tenant's GCP service account on the Cloud Function for your instance.

For the bulk data case, you grant the Storage Object Viewer role to your Worklytics tenant's GCP service account on the sanitized output bucket for your connector.

You can obtain the identity of your Worklytics tenant's GCP service account from the Worklytics portal.

Getting Started with Google Cloud Shell

clone the repo (or a of it)

if using Microsoft 365 sources, install and authenticate Azure CLI

https://docs.microsoft.com/en-us/cli/azure/install-azure-cli

if deploying AWS infra, and authenticate AWS CLI

You should now be ready for the general instructions in the .

GCP Development

The apply (java, maven, etc).

With those, you can can run locally via IntelliJ, using run configs (located in .idea/runConfigurations):

package install core builds the core JAR, on which implementations depend
gcp - run gmail builds and runs a local instance for GMail

Or from command line:

By default, that serves the function from http://localhost:8080.

GMail Example

1.) run terraform init and terraform apply from infra/dev-personal to provision environment

Local

2.) run locally via IntelliJ run config

3.) execute the following to verify your proxy is working OK

Health check (verifies that your client can reach and invoke the proxy at all; and that is has sensible config)

Using a message id you grab from that:

Cloud

1.) deploy to GCP using Terraform (see infra/). Follow steps in any TODO files it generates.

2.) Set your env vars: (these should be in a TODO file generated by terraform in prev step

3.) grant yourself access (probably not needed if you have primitive role in project, like Owner or Editor)

alternatively, you can add Terraform resource for this to your Terraform config, and apply it again:

Either way, if this function is for prod use, please remove these grants after you're finished testing.

4.) invocation examples

GCP Troubleshooting

Tips and tricks for using GCP as to host the proxy.

GCloud CLI client blocked by Organization policy

Some orgs have policies that block authentication of the GCloud CLI client, requiring you to contact your IT team and have it added to an approved list. Apart from that, there are several possibilities:

use the GCP Cloud Shell (via GCP web console). gcloud is pre-installed and pre-authorized as your Google user in the Cloud Shell.
use a VM in GCP Compute Engine, with the VM running as a sufficiently privileged service account. In such a scenario, gcloud will be pre-authenticated by GCP on the VM as that service account.
create credentials within the project itself:
- enable IAM API and Cloud Resource Manager API within the project
- create OAuth credentials for a 'desktop application' within the target GCP project
- download the client-secrets.json file to your environment
- run gcloud auth application-default login --client-id-file=/path/to/client-secrets.json

GCP rejects calls because APIs disabled on target project

Terraform relies on GCP's REST APIs for its operations. If these APIs are disabled either the target project OR the project in which the identity (service account, OAuth client) under which you're running terraform resides, you may get an error.

The solution is to enable APIs via the Cloud Console, specifically:

IAM API
Cloud Resource Manager API

GCP Terraform State Inconsistencies

If some resources seem to not be properly provisioned, try terraform taint or terraform state rm, to force re-creation. Use terrafrom state list | grep to search for specific resource ids.

Error 400 : One or more users named in policy do not belong to a permitted Customer

If you receive an error such as:

You may need define an exception for the GCP project in which you're deploying the proxy, or add the domain of your Worklytics Tenant SA to the list of allowed domains.

General Guides

Guides

Cleaning Up

Done with your Psoxy deployment?

Terraform makes it easy to clean up when you're through with Psoxy, of you wish to rebuild everything from scratch.

First, a few caveats:

this will NOT undo any changes outside of Terraform, even those we instructed you to perform via TODO - files that Terraform may have generated.
be careful with anything you created outside of Terraform and later imported into Terraform, such as GCP project / AWS account themselves. If you DON'T want to destroy these, do terraform state rm <resource> (analogue of the import) for each.

Do the following to destroy your Psoxy infra:

open you main.tf of your terraform confriguation; remove ALL blocks that aren't terraform, or provider. You'll be left with ~30 lines that looks like the following.

NOTE: do not edit your terraform.tfvars file or remove any references to your AWS / Azure / GCP accounts; Terraform needs be authenticated and know where to delete stuff from!

run terraform apply. It'll prompt you with a plan that says "0 to create, 0 to modify" and then some huge number of things to destroy. Type 'yes' to apply it.

That's it. It should remove all the Terraform infra you created.

if you want to rebuild from scratch, revert your changes to main.tf (git checkout main.tf) and then terraform apply again.

Deployment Migration

Overview

This document describes how to migrate your deployment from one cloud provider to another, or one project/account to another. It does not cover migrating between proxy versions.

Use cases:

move from a dev account to a prod account (Account / Project Migration)
move from a "shared" account to a "dedicated" account (Account / Project Migration)
move from AWS --> GCP, and vice versa (Provider Migration)

Preparation

Preserving Existing Infrastructure

Some data/infrastructure MUST, or at least SHOULD be preserved during your migration. Below is an enumeration of both cases.

Data, such as configuration values, can generally be copied; you just need to make a new copy of it in the new environment managed by the new Terraform configuration.

Some infrastructure, such as API Clients, will be moved; eg, the same underlying resource will continue to exist, it will just be managed by the new Terraform configuration instead of the old one. This is the more tedious case, as you must both import this infrastructure to your new configuration and then rm (remove) it from your old configuration, rather than having it be destroyed when you teardown the old configuration. You should carefully review every terraform apply, including terraform destroy commands, to ensure that infrastructure you intend to move is not destroyed, or replaced (eg, terraform sees it as tainted, and does a destroy + create within a single apply operation).

What you MUST copy:

SALT value. This is a secret used to generate the pseudonyms. If this is lost/destroyed, you will be unable to link any data pseudonymized with the original salt to data you process in the future.
- NOTE: the underlying resource to preserve is actually a random_password resource, not an SSM parameter / GCP Secret - because those simply are being filled from the terraform random_password resource; if you import parameter/secret, but not the random_password, Terraform will generate a new value and overwrite the parameter/secret.
- as of v0.4.35 examples, the terraform resource ID for this value is expected to be module.psoxy.module.psoxy.random_password.pseudonym_salt; if not, you can search for it with terraform state list | grep random_password
value for PSEUDONYMIZE_APP_IDS. This value, if set to true will have the proxy use a rule set that pseudonymizes identifiers issued by source applications themselves in some cases where these identifiers aren't inherently PII - but the association could be considered discoverable.
value for EMAIL_CANONICALIZATION. prior to v0.4.52, this default was in effect STRICT; so if your original deployment was built on a version prior to this, you should explicitly set this value to STRICT in your new configuration (likely email_canonicalization variable in terraform modules)
any custom sanitization rules that you've set, either in your Terraform configuration or directly as the value of a RULES environment variable, SSM Parameter, or GCP Secret.
historical sanitized files for any bulk connectors, if you wish to continue to have this data analyzed by Worklytics. (eg, everything from all your -sanitized buckets)

NOTE: you do NOT need to copy the ENCRYPTION_KEY value; rotation of this value should be expected by clients.

What you SHOULD move:

API Clients. Whether generated by Terraform or not, the "API Client" for a data source must typically be authorized by a data source administrator to grant it access to the data source. As such, if you destroy the client, or lose its id, you'll need to coordinate with the administrator again to recreate it / obtain the configuration information.
- as of v0.4.35, Google Workspace and Microsoft 365 API clients are managed directly by Terraform, so these are important to preserve.

What you SHOULD copy:

API Client Secrets, if generated outside of Terraform. If you destroy/lose these values, you'll need to contact the data source administrator to obtain new versions.

Prior to beginning your migration, you should make a list of what existing infrastructure and/or configuration values you intend to move/copy.

Migration Plan

The following is a rough guide on the steps you need to take to migrate your deployment.

Phase 1 : Gather information from Existing Environment

Salt value. If using an example forked from our template repos at v0.4.35 or later, you can find the output block in your main.tf for pseudonym_salt, uncomment it, run terraform apply. You'll then be able to obtain the value with: terraform output --raw pseudonym_salt On macOS, you can copy the value to your clipboard with: terraform output --raw pseudonym_salt | pbcopy
Microsoft 365 API client, if any:
- Find the resource ids: terraform state list | grep "\.azuread_application\."
- For each, obtain it's objectId: terraform state show 'module.psoxy.module.msft-connection["azure-ad"].azuread_application.connector'
- Prepare import command for each client for your new configuration, eg: terraform import 'module.psoxy.module.msft-connection["azure-ad"].azuread_application.connector' '<objectId>'
Google Workspace API clients, if any:
- Find the resource ids: tf state list | grep 'google_service_account\.connector-sa'
- For each, obtain its unique_id: terraform state show 'module.worklytics_connectors_google_workspace.module.google_workspace_connection["gdirectory"].google_service_account.connector-sa'
- Prepare import command for each client for your new configuration, eg: terraform import 'module.worklytics_connectors_google_workspace.module.google_workspace_connection["gdirectory"].google_service_account.connector-sa' '<unique_id>'

Phase 2 : Create New Environment

Create a new Terraform configuration from scratch; run terraform init there (if you begin with one of our examples, our init script does this). Use the terraform.tfvars of your existing configuration as a guide for what variables to set, copying over any needed values.
Run a provisional terraform plan and review.
Run the imports you prepared in Phase 1, if all appear OK, run another terraform plan and review (comparing to the old one).
Optionally, run terraform plan -out=plan.out to create a plan file; if you send this, along with all the *.tf/*.tfvars files to Worklytics, we can review it and confirm that it is correct.
Run terraform apply to create the new infrastructure; re-confirm that the plan is not re-creating any API clients/etc that you intended to preserve
Via AWS / GCP console, or CLIs, move the values of any secrets/parameters that you intend to by directly reading the values from your old account/project, and copying them into the new account/project

Phase 3: Migrate

Look at the TODO 3 files/output variables for all your connectors. Make a mapping between the old values and the new values. Send this to Worklytics. It should include for each the proxy URLs, AWS Role to use, and any other values that are changing.
Wait for confirmation that Worklytics has migrated all your connections to the new values. This may take 1-2 days.

Phase 4: Destroy Old Environment

Remove references to any API Clients you migrated in Phase 1:
- eg, terraform state rm 'module.psoxy.module.msft-connection["azure-ad"].azuread_application.connector'
run terraform destroy in the old configuration. Carefully review the plan before confirming.
- if you're using Google Workspace sources, you may see destruction of google_project_service resources; if you allow these to be destroyed, these APIS will be disabled; if you are using the same GCP project in your other configuration, you should run terraform apply there again to re-enable them.
You may also destroy any API clients/etc that are managed outside of Terraform and which you did not migrate to the new environment.
You may clean up any configuration values, such as SSM Parameters / GCP Secrets to customize the proxy rules sets, that you may have created in your old host environment.

Terraform Cloud / Enterprise

If you're using Terraform Cloud or Enterprise, here are a few things to keep in mind.

NOTE: this is tested only for gcp; for aws YMMV, and in particular we expect Microsoft 365 sources will not work properly, given how those are authenticated.

Getting Started

Prereqs:

git/java/maven, as described here https://github.com/Worklytics/psoxy#required-software-and-permissions
for testing, you'll need the CLI of your host environment (eg, AWS CLI, GCloud CLI, Azure CLI) as well as npm/NodeJS installed on your local machine

After authenticating your terraform CLI to Terraform Cloud/enterprise, you'll need to:

Create a Project in Terraform Cloud; and a workspace within the project.
Clone one of our example repos and run the ./init script to initialize your terraform.tfvars for Terraform Cloud. This will also put a bunch of useful tooling on your machine.

3. Commit the bundle that was output by the ./init script to your repo:

Change the terraform backend main.tf to point to your Terraform Cloud rather than be local
- remove backend block from main.tf
- add a cloud block within the terraform block in main.tf (obtain content from your Terraform Cloud)
run terraform init to migrate the initial "local" state to the remote state in Terraform Cloud
You'll have to authenticate your Terraform Cloud with Google / AWS / Azure, depending on the cloud you're deploying to / data sources you're using.

TODOs as Outputs

If you're using Terraform Cloud or Enterprise, our convention of writing "TODOs" to the local file system might not work for you.

To address this, we've updated most of our examples to also output todo values as Terraform outputs, todos_1, todos_2, etc.

To get them nicely on your local machine, something like the following:

Terraform API

get an API token from your Terraform Cloud or Enterprise instance (eg, https://developer.hashicorp.com/terraform/cloud-docs/users-teams-organizations/api-tokens).
set it as an env variable, as well as the host:

run a curl command using those values to get each todos:

Terraform CLI

If you have terraform CLI auth'd against your Terraform Cloud or Enterprise instance, then you might be able to avoid the curl-hackery above, and instead use the following:

(This approach should also work with Terraform CLI running with backend, rather than cloud)

Testing Locally

If you have run our init script locally (as suggested in 'Getting Started') then the test tool should have been installed (likely at .terraform/modules/psoxy/tools/). You will need to update everything in todos_2.md to point to this path for those test commands to work.

If you need to directly install/re-install it, something like the following should work:

Testing

By default, the Terraform examples provided by Worklytics install a NodeJS-based tool for testing your proxy deployments.

Full documentation of the test tool is available . And the code is located in the tools directory of the .

Testing Pre-requisites

Wherever you run this test tool from, your AWS or GCloud CLI must be authenticated as an entity with permissions to invoke the Lambda functions / Cloud functions that you deployed for Psoxy.

If you're testing the bulk cases, the entity must be able to read/write to the cloud storage buckets created for each of those bulk examples.

Testing Locally when Terraform ran remotely (eg, Terraform Cloud, GitHub Actions, etc)

If you're running the Terraform examples in a different location from where you wish to run tests, then you can install the tool alone:

Clone the Psoxy repo to your local machine:

From within that clone, install the test tool:

Get specific test commands for your deployment
- If you set the todos_as_outputs variable to true, your Terraform apply run should contain todo2 output variable with testing instructions.
- If you set todos_as_local_files variable to true, your Terraform apply run should contain local files named TODO 2 ... with testing instructions.
In both cases, you will need to replace the test tool path included there with the path to your installation.
Example commands of the primary testing tool: "Psoxy Test Calls"

Testing Deployments made without Terraform

If you used and approach other than Terraform, or did not directly use our Terraform examples, you may not have the testing examples or the test tool installed on your machine.

In such a case, you can install the test tool manually by following steps 1+2 above, and then can review the documentation on how to use it from your machine.

General Troubleshooting

Specific Data Sources

Error: Attempted to load application default credentials (Google provider authentication failure)

This is related to gcloud not being authenticated (or installed?) in the environment where you're running terraform, which the google terraform provider requires.

If you DO NOT intend to use Google Workspace as a data source, you should do the following:

remove the google-*.tf files from your terraform configuration
remove module/local references from your main.tf file that referred to those files; as of v0.4.53, there are 3 such references you must remove; you will get errors in terraform commands until you remove all of them. The error messages should reference the impacted line numbers.

If you DO intend to use Google Workspace as a data source, you must install and authenticate the gcloud CLI and/or modify the google provider block in google-workspace.tf with your desired authentication details. See:

Specific Host platforms:

General Tips

Verify Pre-Requisites

Our example templates include a script to check for the prerequisites for running the psoxy. You can run this prior to ./init to get feedback/suggestions on what prerequisites you may be missing and how to install them.

General Build / Packaging Failures

Our example Terraform configurations should compile and package the Java code into a JAR file, which is then deployed by Terraform to your host environment.

If, on your first terraform plan/terraform apply, you see the line such as

module.psoxy-aws-msft-365.module.psoxy-aws.module.psoxy-package.data.external.deployment_package: Reading...

And that returns really quickly, something may have gone wrong with the build. You can trigger the build directly by running:

That may give you some clues as to what went wrong.

You can also look for a file called last-build.log in the directory where your Terraform configuration resides.

If you want to go step-by-step, you can run the following commands:

Some problems we've seen:

Maven repository access - the build process must get various dependencies from a remote Maven respository; if your laptop cannot reach Maven Central, is configured to get dependencies from some other Maven repository, etc - you might need to fix this issue. You can check your ~/.m2/settings.xml file, which might give you some insight into what Maven repository you're using. It's also where you'd configure credentials for a private Maven repository, such as Artifactory/etc - so make sure those are correct.

Upgrading Psoxy Code

If you upgrade your psoxy code, it may be worth trying terraform init --upgrade to make sure you have the latest versions of all Terraform providers on which our configuration depends.

By default, terraform locks providers to the version that was the latest when you first ran terraform init. It does not upgrade them unless you explicitly instruct it to. It will not prompt you to upgrade them unless we update the version constraints in our modules.

While we strive to ensure accurate version constraints, and use provider features consistent with these constraints, our automated tests will run with the latest version of each provider. Regretably, we don't currently have a way to test with ALL versions of each provider that satisfy the constraints, or all possible combinations of provider versions.

State Inconsistencies

Often, in response to errors, a second run of terraform apply will work.

If something was actually created in the cloud provider, but Terraform state doesn't reflect it, then try terraform import [resource] [provider-resource-id]. [resource] should be replaced with whatever the path to it is in your terraform configuration, which you can get from the terraform plan output. provider-resource-id is a little trickier, and you might need to find the format required by finding the Terraform docs for the resource type on the web.

NOTE: resources in plan with brackets/quotes will need these escaped with a backslash for use in bash commands.

Unsupported Terraform versions

Errors such as the following on terraform plan?

The solution is to downgrade your Terraform version to one that's supported by our modules (>= 1.3.x, <= 1.7.x as of March 2024).

If you're running Terraform in cloud/CI environment, including Terraform Cloud, GitHub Actions, etc, you can likely explicitly set the desired Terraform version in your workspace settings / terraform setup action.

Configuration

JSON Filter

JSON Filter is inspired by , but with the goal to filter documents rather than validate. As such, the basic idea is that data nodes that do not match the filter schema are removed, rather than the whole document failing validation.

The goal of JsonFilter is that only data elements specified in the filter pass through.

Some differences:

required properties are ignored. While in JSON schema, an object that was missing a "required" property is invalid, objects missing "required" properties in a filter will be preserved.
{ } , eg, a schema without a type, is interpreted as any valid leaf node (eg, unconstrained leaf; everything that's not 'array' or 'object') - rather than any valid JSON.

Compatibility goals:

a valid JSON Schema is convertible to valid JSON filter (with JSON schema features not supported by JSON filter ignored)

Motivations

`{ }` as "any valid leaf node"

compactness, esp when encoding filter as YAML. can put { } instead of { "type": "string" }
flexibility; for filtering use-case, often you just care about which properties are/aren't passed, rather than 'string' vs 'number' vs 'integer'
"any valid JSON" is more common use case in validation than in filtering.

Development

Approaches for Example / Module design

The problem we're trying to solve is that various features, such as VPCs/etc, are relevant to a small set of users. It would complicate the usual cases to enable them for all cases. So we need to provide easy support for extending the examples/modules to support them in the extreme cases.

Composition is the canonical terraform approach.

3 approaches:

composition, which is canonical terraform a. commented out
- validation
- instructions to explain to customers are more complex b. conditional : validation will work, but hacky 0 indexes around in places
conditionals + variables pros:
- simplest for customers
- easiest to read/follow cons:
- verbose interfaces
- brittle stacks (changing variable requires changing many in hierarch)

Java Annotation Proccessing

For core, seems need to use explicit processor path (which IntelliJ fills for you), rather than classpath. And output generated code to Module Content Root, not Module Output directory.

Atlassian

API Data Sanitization

Psoxy supports specifying sanitization rule sets to use to sanitize data from an API. These can be configured by encoding a rule set in YAML and setting a parameter in your instance's configuration. See an example of rules for Zoom: zoom.yaml.

If such a parameter is not set, a proxy instances selects default rules based on source kind, from the corresponding supported source.

You can configure custom rule sets for a given instance via Terraform, by adding an entry to the custom_api_connector_rules map in your terraform.tfvars file.

eg,

custom_api_connector_rules = {
    gmail: "custom-gmail.yaml"
}

API Connector Rules Syntax

<ruleset> ::= "endpoints:" <endpoint-list> <endpoint-list> ::= <endpoint> | <endpoint> <endpoint-list>

A ruleset is a list of API endpoints that are permitted to be invoked through the proxy. Requests which do not match a endpoint in this list will be rejected with a 403 response.

Endpoint Specification

<endpoint> ::= <path-template> <allowed-methods> <path-parameter-schemas> <query-parameter-schemas> <response-schema> <transforms>

<path-template> ::= "- pathTemplate: " <string> Each endpoint is specified by a path template, based on OpenAPI Spec v3.0.0 Path Template syntax. Variable path segments are enclosed in curly braces ({}) and are matched by any value that does not contain an / character.

See: https://swagger.io/docs/specification/paths-and-operations/

Allowed Methods

<allowed-methods> ::= "- allowedMethods: " <method-list> <method-list> ::= <method> | <method> <method-list>

If provided, only HTTP methods included in this list will be permitted for the endpoint. Given semantics of RESTful APIs, this allows an additional point to enforce "read-only" data access, in addition to OAuth scopes/etc.

NOTE: for AWS-hosted deployments using API Gateway, IAM policies and routes may also be used to restrict HTTP methods. See aws/guides/api-gateway.md for more details.

Path / Query Parameter Schemas

<path-parameter-schemas> ::= "- pathParameterSchemas: " <parameter-schema> <query-parameter-schemas> ::= "- queryParameterSchemas: " <parameter-schema>

alpha - a parameter schema to use to validate path/query parameter values; if validation fails, proxy will return 403 forbidden response. Given the use-case of validating URL / query parameters, only a small subset of JSON Schema is supported.

As of 0.4.38, this is considered an alpha feature which may change in backwards-incompatible ways.

Currently, the supports JSON Schema features are:

type :
- string value must be a JSON string
- integer value must be a JSON integer
- number value must be a JSON number
format :
- reversible-pseudonym : value MUST be a reversible pseudonym generated by the proxy
pattern : a regex pattern to match against the value
enum : a list of values to match against the value

null/empty is valid for all types; you can use a pattern to restrict this further.

Response Schema

<response-schema> ::= "responseSchema: " <json-schema-filter>

See: Response Schema Specification below.

<transforms> ::= "transforms:" <transform-list> <transform-list> ::= <transform> | <transform> <transform-list>

For each Endpoint, rules specify a list of transforms to apply to the response content.

Transform Specification

<transform> ::= "- " <transform-type> <json-paths> [<encoding>]

Each transform is specified by a transform type and a list of JSON paths. The transform is applied to all portions of the response content that match any of the JSON paths.

Supported Transform Types:

NOTE: these are implementations of com.avaulta.gateway.rules.transforms.Transform class in the psoxy codebase.

Pseudonymize

!<pseudonymize> - transforms matching values by normalizing them (triming whitespace; if appear to be emails, treating them as case-insensitive, etc) and computing a SHA-256 hash of the normalized value. Relies on SALT value configured in your proxy environment to ensure the SHA-256 is deterministic across time and between sources. In the case of emails, the domain portion is preserved, although the hash is still based on the entire normalized value (avoids hash of alice@acme.com matching hash of alice@beta.com).

Options:

includeReversible (default: false): If true, an encrypted form of the original value will be included in the result. This value, if passed back to the proxy in a URL, will be decrypted back to the original value before the request is forward to the data source. This is useful for identifying values that are needed as parameters for subsequent API requests. This relies on symmetric encryption using the ENCRYPTION_KEY secret stored in the proxy; if ENCRYPTION_KEY is rotated, any 'reversible' value previously generated will no longer be able to be decrypted by the proxy.
encoding (default: JSON): The encoding to use when serializing the pseudonym to a string.
- JSON - a JSON object structure, with explicit fields
- URL_SAFE_TOKEN - a string format that aims to be concise, URL-safe, and format-preserving for email case.

Pseudonymize Email Header

!<pseudonymizeEmailHeader> - transforms matching values by parsing the value as an email header, in accordance with RFC 2822 and some typical conventions, and generating a pseudonym based only on the normalized email address itself (ignoring name, etc that may appear) . In particular:

deals with CSV lists (multiple emails in a single header)
handles the name <email> format, in effect redacting the name and replacing with a pseudonym based only on normalized email

Redact

!<redact> - removes the matching values from the response.

Some extensions of redaction are also supported:

!<redactExceptSubstringsMatchingRegexes> - removes the matching values from the response except value matches one of the specified regex options. (Use case: preserving portions of event titles if match variants of 'Focus Time', 'No Meetings', etc)
!<redactRegexMatches> - redact content IF it matches one of the regexs included as an option.

By using a negation in the JSON Path for the transformation, !<redact> can be used to implement default-deny style rules, where all fields are redacted except those explicitly listed in the JSON Path expression. This can also redact object-valued fields, conditionally based on object properties as shown below.

Eg, the following redacts all headers that have a name value other than those explicitly listed below:

- !<redact>
  jsonPaths:
    - "$.messages.payload.headers[?(!(@.name =~ /^From|To|Cc|Bcc|X-Original-Sender|Delivered-To|Sender|Message-ID|Date|In-Reply-To|Original-Message-ID|References$/i))]"

Tokenize

!<tokenize> - replaces matching values it with a reversible token, which proxy can reverse to the original value using ENCRYPTION_KEY secret stored in the proxy in subsequent requests.

Use case are values that may be sensitive, but are opaque. For example, page tokens in Microsoft Graph API do not have a defined structure, but in practice contain PII.

Options:

regex a capturing regex to use to extract portion of value that needs to be tokenized.

Filter Tokens by Regex

!<filterTokenByRegex> - tokenizes matching string values by a delimiter, if provided; and matches result against a list of filters, removing any content that doesn't match at least one of the filters. (Use case: preserving Zoom URLs in meeting descriptions, while removing the rest of the description)

Options:

delimiter - used to split the value into tokens; if not provided, the entire value is treated as a single token.
filters - in effect, combined via OR; tokens matching ANY of the filters is preserved in the value.

Response Schema Specification

A "response schema" is a "JSON Schema Filter" structure, specifying how response (which must be JSON) should be filtered. Using this, you can implement a "default deny" approach to sanitizing API fields in a manner that may be more convenient than using JSON paths with conditional negations (a redact transform with a JSON path that matches all but an explicit list of named fields is the other approach to implementing 'default deny' style rules).

Our "JSON Schema Filter" implementation attempts to align to the JSON Schema specification, with some variation as it is intended for filtering rather than validation. But generally speaking, you should be able to copy the JSON Schema for an API endpoint from its OpenAPI specification as a starting point for the responseSchema value in your rule set. Similarly, there are tools that can generate JSON Schema from example JSON content, as well as from data models in various languages, that may be useful.

See: https://json-schema.org/implementations.html#schema-generators

If a responseSchema attribute is specified for an endpoint, the response content will be filtered (rather than validated) against that schema. Eg, fields NOT specified in the schema, or not of expected type, will be removed from the response.

type - one of :
- object a JSON object
- array a JSON array
- string a JSON string
- number a JSON number, either integer or decimal.
- integer a JSON integer (not a decimal)
- boolean a JSON boolean
properties - for type == object, a map of field names to schema to filter field's value against (eg, another JsonSchemaFilter for the field itself)
items - for type == array, a schema to filter each item in the array against (again, a JsonSchemaFilter)
format - for type == string, a format to expect for the string value. As of v0.4.38, this is not enforced by the proxy.
$ref - a reference to a schema specified in the definitions property of the root schema.
definitions - a map of schema names to schemas of type JsonSchemaFilter; only supported at root schema of endpoint.

Example:

The following is for a User from the GitHub API, via graphql. See: https://docs.github.com/en/graphql/guides/forming-calls-with-graphql#the-graphql-endpoint

- pathTemplate: "/graphql"
  responseSchema:
    type: "object"
    properties:
      data:
        type: "object"
        properties:
          organization:
            type: "object"
            properties:
              samlIdentityProvider:
                type: "object"
                properties:
                  externalIdentities:
                    type: "object"
                    properties:
                      pageInfo:
                        type: "object"
                        properties:
                          hasNextPage:
                            type: "boolean"
                          endCursor:
                            type: "string"
                      edges:
                        type: "array"
                        items:
                          type: "object"
                          properties:
                            node:
                              type: "object"
                              properties:
                                guid:
                                  type: "string"
                                samlIdentity:
                                  type: "object"
                                  properties:
                                    nameId:
                                      type: "string"
                                user:
                                  type: "object"
                                  properties:
                                    login:
                                      type: "string"
              membersWithRole:
                type: "object"
                properties:
                  pageInfo:
                    type: "object"
                    properties:
                      hasNextPage:
                        type: "boolean"
                      endCursor:
                        type: "string"
                  edges:
                    type: "array"
                    items:
                      type: "object"
                      properties:
                        node:
                          type: "object"
                          properties:
                            email:
                              type: "string"
                            login:
                              type: "string"
                            id:
                              type: "string"
                            isSiteAdmin:
                              type: "boolean"
                            organizationVerifiedDomainEmails:
                              type: "array"
                              items:
                                type: "string"
      errors:
        type: "array"
        items:
          type: "object"
          properties:
            type:
              type: "string"
            path:
              type: "array"
              items:
                type: "string"
            locations:
              type: "array"
              items:
                type: "object"
                properties:
                  line:
                    type: "integer"
                  column:
                    type: "integer"
            message:
              type: "string"