Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
There are two connection legs to consider with regard to authentication and authorization in API mode:
between Worklytics and the proxy (your host cloud)
between the proxy and the data source API
Eg, Worklytics initiates an API request to the proxy (1); which, after validating the request, forwards it to the data source API on behalf of Worklytics (2), adding its additional authentication information.
Worklytics is authorized to access your proxy instance via an Identity and Access Management (IAM) policy which you must configure in your host platform. The exact details vary by cloud provider:
Worklytics authenticates in all cases via Workload Identity Federation; as your Worklytics tenant is running natively in the cloud, it can leverage the cloud provider's native IAM service to establish identity which can be asserted to other services in the cloud.
Although exact details vary by data source, most utilize some form of OAuth 2.0 for authorization and authentication.
A data source admin (eg, a Google Workspace admin) must authorize the proxy to access the data source via the data source's admin console. This typically involves creating a new OAuth 2.0 client and granting that client a set of oauth scopes required to support the API calls that will be made on behalf of Worklytics. A detailed list of scopes required for each data source is specified in the documentation of each connector.
See https://docs.worklytics.co/psoxy#supported-data-sources
The proxy authenticates itself for calls to the data source using one of the supported OAuth 2.0 mechanisms, see [https://oauth.net/2/client-authentication/]. Most commonly, these are Client Credentials or Workload Identity Federation.
In particular, a quick overview for common sources:
Microsoft 365 sources authenticate via Workload Identity Federation
Google Workspace sources authenticate via Client Credentials (a GCP Service Account key)
GitHub authenticates via Client Credentials (a GitHub App client id + key)
Jira authenticates via Client Credentials (a Jira App client id + secret)
Slack authenticates via Client Credentials (a Slack App token)
Salesforce authenticates via Client Credentials (a Salesforce App client id + secret)
Zoom authenticates via Client Credentials (a Zoom App client id + secret)
In all cases relying on secrets (a key, client secret, token, etc) to authenticate, these values are stored in the secret store implementation of your Host cloud provider (eg, GCP Secret Manager) and never passed to or accessed by Worklytics. Worklytics has no means to directly connect to any of your data sources.
These shell command examples presume Ubuntu; you may need to translate to your *nix variant. If you starting with a fairly rich environment, many of these tools may already be on your machine.
install dependencies
install Java + maven (required to build the proxy binary to be deployed)
install Terraform
Follow Terraform's install guide (recommended) or, if you need to manage multiple Terraform versions, use tfenv
:
if you're deploying in AWS, install the AWS CLI
if you want to test an AWS deployment, install AWS Curl (which requires python
3.6+ and pip
)
install pip
(likely included with fresh python install), then use that to install awscurl
if deploying to GCP or using Google Workspace data sources, install Google Cloud CLI and authenticate.
if using Microsoft 365 data sources, install Azure CLI and authenticate.
https://docs.microsoft.com/en-us/cli/azure/install-azure-cli
You should now be ready for the general instructions in the README.md.
This page provides an overview of how Psoxy authenticates and confirms authorization of clients (Worklytics tenants).
For general overview of how Psoxy is authorized to access data sources, and authenticates when making API calls to those sources, see API Mode Authentication and Authorization.
Each Worklytics tenant operates as a unique GCP service account within Google Cloud. GCP issues an identity token for this service account to processes running in the tenant, which the tenant then uses to authenticate against AWS.
This is OIDC based identity federation (aka "web identity federation" or "workload identity federation").
No secrets or keys need to be exchanged between Worklytics and your AWS instance. The integrity of the authentication is provided by the signature of the identity token provided by GCP, which AWS verifies against Google's public certificates.
AWS provides an overview of the specific GCP Case: Access AWS using a Google Cloud Platform native workload identity
Annotating the diagram for the above case, with specific components for Worklytics-->Proxy case:
In the above, the AWS resource you're allowing access to is AWS IAM role, which your Worklytics tenant assumes and then can access S3 or invoke Lambda function.
Within your AWS account, you create an IAM role, with a role assumption policy that allows your Worklytics tenant's GCP Service Account (identified by a numeric ID you obtain from the Worklytics portal) to assume the role.
This assumption policy will have a statement similar to the following, where the value of the aud
claim is the numeric ID of your Worklytics tenant's GCP Service Account:
Colloquially, this allows a web identity federated from accounts.google.com
where Google has asserted the claim that aud
== 12345678901234567890123456789
to assume the role.
Then you use this AWS IAM role as the principal in AWS IAM policies you define to authorize to invoke your proxy instances via their function URLs (API connectors) or to read from their sanitized output buckets (bulk data connectors)
You'll provision the following to host Psoxy in AWS:
S3 buckets, if using the 'bulk' mode to sanitize file data (such as CSVs); see S3 docs
Cognito Pools and Identities, if connecting to Microsoft 365 data sources
The diagram below provides an architecture overview of the 'API' and 'Bulk' mode use-cases.
An AWS Account in which to deploy Psoxy We strongly recommend you provision one specifically for use to host Psoxy, as this will create an implicit security boundary, reduce possible conflicts with other infra configured in the account, and simplify eventual cleanup.
You will need the numeric AWS Account ID for this account, which you can find in the AWS Console.
If your AWS organization enforces Service Control Policies, ensure that these are allow the AWS components required by Psoxy or exempt the AWS Account in which you will deploy Psoxy from these policies.
If your organization uses any sort of security control enforcement mechanism, you may have disable/provide exceptions to those controls for you initial deployment. Then generally those controls can be implemented later by extending our examples. Our protips page provides some guidance on how to extend the base examples to meet more extreme requirements.
A sufficiently privileged AWS Role You must have a IAM Role within the AWS account with sufficient privileges to (AWS managed policy examples linked):
create IAM roles + policies (eg IAMFullAccess)
create and update Systems Manager Parameters (eg, AmazonSSMFullAccess )
create and manage Lambdas (eg AWSLambda_FullAccess )
create and manage S3 buckets (eg AmazonS3FullAccess )
create Cloud Watch Log groups (eg CloudWatchFullAccess)
(Yes, the use of AWS Managed Policies results in a role with many privileges; that's why we recommend you use a dedicated AWS account to host proxy which is NOT shared with any other use case)
You will need the ARN of this role.
NOTE: if you're connecting to Microsoft 365 (Azure AD) data sources, you'll also need permissions to create AWS Cognito Identity Pools and add Identities to them, such as arn:aws:iam::aws:policy/AmazonCognitoPowerUser. Some AWS Organizations have Service Control Policies in place that deny this by default, even if you have an IAM role that allows it at an account level.
NOTE: using AWS API Gateway, VPC, or Secrets Manager (not used by default in our examples) will require additional permissions beyond the above.
See: protips.md for guide to create a least-privileged iam policy for provisioning.
An authenticated AWS CLI in your provisioning environment. Your environment (eg, shell/etc from which you'll run terraform commands) must be authenticated as an identity that can assume that role. (see next section for tips on options for various environments you can use)
Eg, if your Role is arn:aws:iam::123456789012:role/PsoxyProvisioningRole
, the following should work:
To provision AWS infra, you'll need the aws-cli
installed and authenticated on the environment where you'll run terraform
.
Here are a few options:
Generate an AWS Access Key for your AWS User.
Run aws configure
in a terminal on the machine you plan to use, and configure it with the key you generated in step one.
NOTE: this could even be a GCP Cloud Shell, which may simplify auth if your wish to connect your Psoxy instance to Google Workspace as a data source.
If your organization prefers NOT to authorize the AWS CLI on individual laptops and/or outside AWS, provisioning Psoxy's required infra from an EC2 instance may be an option.
provision an EC2 instance (or request that your IT/dev ops team provision one for you). We recommend a micro instance with an 8GB disk, running ubuntu
(not Amazon Linux; if you choose that or something else, you may need to adapt these instructions). Be sure to create a PEM key to access it via SSH (unless your AWS Organization/account provides some other ssh solution).
associate the Role above with your instance (see https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html)
Whichever environment you choose, follow general prereq installation.
You'll also need a backend location for your Terraform state (such as an S3 bucket). It can be in any AWS account, as long as the AWS role that you'll use to run Terraform has read/write access to it.
See https://developer.hashicorp.com/terraform/language/settings/backends/s3.
Alternatively, you may use a local file system, but this is not recommended for production use - as your Terraform state may contain secrets such as API keys, depending on the sources you connect.
See https://developer.hashicorp.com/terraform/language/settings/backends/local.
The module psoxy-constants is a dependency-free module that provides lists of AWS managed policies, etc needed for bootstraping a AWS account in which your proxy instances will reside.
Once you've fulfilled the prereqs, including having your terraform deployment environment, backend, and AWS account prepared, we suggest you use our AWS example template repo:
Follow the 'Usage' instructions there to continue.
YMMV; as of June 2023, AWS's 1GB limit on cloud shell persistent storage is too low for real world proxy deployments, which typically require install gcloud CLI / Azure CLI to connect to sources
So use use your local machine, or a VM/container elsewhere in AWS (EC2, AWS Cloud9, etc
clone the repo
add the following lines to your ~/.bashrc
. (AWS Cloud Shell preserves only your HOME directory across sessions, so add any commands that modify/install things outside to your .bashrc
)
Then source ~/.bashrc
, to execute the above.
install Terraform
if using Microsoft 365 data sources, install Azure CLI and authenticate.
https://docs.microsoft.com/en-us/cli/azure/install-azure-cli
If default NodeJS tooling doesn't work for you, legacy testing tools use python/awscurl, installed via pip. See example below:
Psoxy is a serverless, pseudonymizing, Data Loss Prevention (DLP) layer between Worklytics and your data sources. It acts as a Security / Compliance layer, which you can deploy between your data sources (SaaS tool APIs, Cloud storage buckets, etc) and Worklytics.
Benefits include:
Granular authorization on the API endpoint, parameter, and field-levels to your Sources. Eg, limit Worklytics to calling ONLY an explicit subset of an APIs endpoints, with an explicit set of possible parameters, and receiving ONLY a subset fields in response.
no API keys for your data sources are ever sent or held by Worklytics.
any PII present in your data can be pseudonymized before being sent to Worklytics
sensitive data can be redacted before being sent to Worklytics
Psoxy can be deployed/used in 3 different modes:
API - psoxy sits in front of a data source API. Any call that would normally be sent to the data source API is instead sent to psoxy, which parses the request, validates it / applies ACL, and adds authentication before forwarding to the host API. After the host API response, psoxy sanitizes the response as defined by its roles before returning the response to the caller. This is an http triggered flow.
Bulk File - psoxy is triggered by files (objects) being uploaded to cloud storage buckets (eg, S3, GCS, etc). Psoxy reads the incoming file, applies one or more sanitization rules (transforms), writing the result(s) to a destination (usually in distinct bucket).
Command-line (cli) - psoxy is invoked from the command-line, and is used to sanitize data stored in files on the local machine. This is useful for testing, or for one-off data sanitization tasks. Resulting files can be uploaded to Worklytics via the file upload of its web portal.
Data transfer via Psoxy provides a layered approach to data protection, with various redundancies against vulnerabilities / misconfigurations to controls implemented at each layer.
Data source API authorization The API of your data source limit the data which your proxy instance can access to a set of . Typically, these align to a set of API endpoints that a given authentication credential is authorized to invoke. In some cases, oauth scopes may limit the fields returned in responses from various endpoints.
Host Platform ACL (IAM) Your proxy instances will be hosted in your preferred cloud hosting provider (eg, AWS, GCP) and access restricted per your host's ACL capabilities. Typically, this means only principals (to borrow AWS's parlance, eg users/roles/etc) which you authorize via an IAM policy can invoke your proxy instances. Apart from limiting who can access data via you proxy instance, IAM rules can enforce read-only access to RESTful APIs by limited the allowed HTTP methods to GET
/HEAD
/etc.
Proxy-level ACL Psoxy itself offers a sophisticated set of access restriction rules, including limiting access by: - HTTP method (eg, limit to GET
/HEAD
to ensure read-only access) - API endpoint (eg, limit access to /files/{fileId}/metadata
) - API parameter (eg, allow only page,pageSize
as parameters)
Proxy-level response transformation Psoxy can be configured to sanitize fields in API responses, including:
pseudonymizing/tokenizing fields that include PII or sensitive identifiers
redacting fields containing sensitive information or which aren't needed for analysis
Together, these layers of data protection can redundantly control data access. Eg, you could ensure read-only access to GMail metadata by:
granting the Gmail metadata-only oauth scope to your instance via the Google Workspace Admin console, instead of the full Gmail API scope
restricting only GET
requests to your proxy instance via AWS IAM policy
configure rules in your Proxy instance allow only GET
requests to be sent to Gmail API via your instances; and only to eht /gmail/v1/users/{mailboxId}/messages
and /gmail/v1/users/{mailboxId}/messages/{messageId}
endpoints
configure rules in your Proxy instance that filter responses to an explicit set of metadata fields form those endpoints
This example illustrates how the proxy provides data protection across several redundant layers, each provided by different parties. Eg:
you trust Google to correctly implement their oauth scopes and API access controls to limit the access to gmail metadata
you trust AWS to correctly implement their IAM service, enforcing IAM policy to limit data access to the methods and principals you configure.
you trust the Psoxy implementation, which is source-available for your review and testing, to properly implement its specified rules/functionality.
you trust Worklytics to implement its service to not store or process non-metadata fields, even if accessible.
You can verify this trust via the logging provided by your data source (API calls received), your cloud host (eg, AWS cloud watch logs include API calls made via the proxy instance), the psoxy testing tools to simulate API calls and inspect responses, and Worklytics logs.
As of June 2023, the following resources provisioned by Psoxy modules support use of CMEKs:
Lambda function environment variables
SSM Parameters
Cloud Watch Log Groups
S3 Buckets
The psoxy-example-aws
example provides a project_aws_key_arn
variable, that, if provided, will be set as the encryption key for these resources. A few caveats:
The AWS principal your Terraform is running as must have permissions to encrypt/decrypt with the key (it needs to be able to read/write the lambda env, ssm params, etc)
The key should be in the same AWS region you're deploying to.
CloudWatch must be able to use the key, as described in
In example-dev/aws-all/kms-cmek.tf
, we provide a bunch of lines that you can uncomment to use encryption on S3 and properly set key policy to support S3/CloudWatch use.
For production use, you should adapt the key policy to your environment and scope as needed to follow your security policies, such as principle of least privilege.
If you need more granular control of CMEK by resource type, review the main.tf
and variables exposed by the aws-host
module for some options.
- brittle! YMMV.
Some ideas on how to support scenarios and configuration requirements beyond what our default examples show:
see
If you're using our AWS example, it should support a default_tags
variable.
You can add the following in your terrform.tfvars
file to set tags on all resources created by the example configuration:
If you're not using our AWS example, you can add the following to your configuration, then you will need to modify the aws
provider block in your configuration to add a default_tags
. Example shown below:
See: [https://registry.terraform.io/providers/hashicorp/aws/latest/docs#default_tags]
To support extensibility, our Terraform examples/modules output the IDs/names of the major resources they create, so that you can compose them with other Terraform resources.
The aws-host
module outputs bulk_connector_instances
; a map of id => instance
for each bulk connector. Each of these has two attributes that correspond to the names of its related buckets:
sanitized_bucket_name
input_bucket_name
So in our AWS example, you can use these to enable logging, for example, you could do something like this: (YMMV, syntax etc should be tested)
See s3-extra-sec.tf
in example repo from v0.4.58+ for example code you can uncomment and modify.
You can also set bucket-level policies to restrict access to SSL-only, with something like the following:
Analogous approaches can be used to configure versioning, replication, etc;
Note that encryption, lifecycle, public_access_block are set by the Workltyics-provided modules, so you may have conflicts issues if you also try to set those outside.
beta - released from v0.4.50; YMMV, and may be subject to change.
The terraform modules we provide provision execution roles for each lambda function, and attach by default attach the appropriate AWS Managed Policy to each.
For organizations that don't allow use of AWS Managed Policies, you can use the aws_lambda_execution_role_policy_arn
variable to pass in an alternative which will be used INSTEAD of the AWS Managed Policy.
YMMV, but we exposed a minimal IAM policy for provisioning in the psoxy-constants
module, which you attach to your desired role to ensure it has sufficient permissions to provision the proxy.
NOTE: using features beyond the default set, such as AWS API Gateway, VPC, or Secrets Manager, may require some additional permissions beyond what is provided in the least-privileged policy.
beta - we're not committed that maintaining this under versioning policy; minor proxy iterations may require changes to privileges required in the least-privileged role.
This is a guide about how to create a role for provisioning psoxy infrastructure in AWS, following the principle of least-privilege at permission-level, rather than policy-level.
Eg, as of v0.4.55 of the proxy, our docs provide guidance on using an AWS role to provision your psoxy infrastructure using the least-privileged set of AWS managed policies possible. A stronger standard would be to use a custom IAM policy rather than AWS managed policy, with the least-privileged set of permissions required.
Additionally, you can specify resource constraints to improve security within a shared AWS account. (However, we do not recommend or officially support deployment into a shared AWS account. We recommend deploying your proxy instances in isolated AWS account to provide an implicit security boundary by default, as an additional layer of protection beyond those provided by our proxy modules)
We provide an example IAM policy document in our psoxy-constants
module that you can use to create a IAM policy in AWS. You can do this outside terraform, finding the JSON from that policy OR via terraform as follows:
if using Google Workspace data sources, and authenticate.
You should now be ready for the general instructions in the .
Specifically, this is , unless you're using a VPC - in which case it is AWSLambdaVPCAccessExecutionRole
(https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AWSLambdaVPCAccessExecutionRole.html).
Some organizations require use of API Gateway. This is not the default approach for Psoxy since AWS added support for Lambda Function URLs (March 2022), which are a simpler and more direct way to expose lambdas via HTTPS.
Nonetheless, should you wish to use API Gateway we provide beta support for this. It is needed if you wish to put your Lambda functions on a VPC (See lambdas-on-vpc.md
).
In particular:
IAM policy that allows api gateway methods to be invoked by the proxy caller role is defined once, using wildcards, and exposes GET/HEAD/POST methods for all resources. While methods are further constrained by routes and the proxy rules themselves, this could be another enforcement point at the infrastructure level - at expense of N policies + attachments in your terraform plan instead of 1.
proxy instances exposed as lambda function urls have 55s timeout, but API Gateway seems to support 30s as max - so this may cause timeouts in certain APIs
Prerequisites:
the AWS principal (user or role) to provision API gateways. The AWS managed policy AmazonAPIGatewayAdministrator
provides this.
Add the following to your terraform.tfvars
file:
Then terraform apply
should create of API gateway-related resources, including policies/etc; and destroy lambda function urls (if you've previously applied with use_api_gateway=false
, which is the default).
If you wish to use API Gateway V1, you will not be able to use the flag above. Instead, you'll have to do something like the following:
Additionally, you'll need to set a different handler class to be invoked instead of the default (co.workltyics.psoxy.Handler
, should be co.worklytics.psoxy.APIGatewayV1Handler
). This can be done in Terraform or by modifying configuration via AWS Console.
Required:
Optional:
AWS SAM CLI (macOS) for local testing, if desired
awscurl for direct testing of deployed AWS lambda from a terminal
Maven build produces a zip file.
Build core library
From java/impl/aws/
:
Locally, you can test function's behavior from invocation on a JSON payload (but not how the API gateway will map HTTP requests to that JSON payload):
https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-using-invoke.html
We recommend deploying your Psoxy code into AWS using the terraform modules found in [infra/modules/
](../../infra/modules/] for AWS. These modules both provision the required AWS infrastructure, as well as deploying the built binaries for Psoxy as lambdas in the target account.
Example configurations using those modules can be found in `infra/examples/.
You'll ultimately provision infrastructure represented in green in the following diagram:
![AWS data flow](./2022-02 Psoxy Data Flow.png)
See infra/modules/aws/
for more information.
Tips and tricks for using AWS as to host the proxy.
If above doesn't happen seem to work as expected, some ideas in the next section may help.
Options:
execute terraform via AWS Cloud Shell
find credentials output by your SSO helper (eg, aws-okta
) then fill the AWS CLI env variables yourself:
if your SSO helper fills default AWS credentials file but simply doesn't set the env vars, you may be able to export the profile to AWS_PROFILE
, eg
References: https://discuss.hashicorp.com/t/using-credential-created-by-aws-sso-for-terraform/23075/7
Options:
execute terraform via AWS Cloud Shell
use a script such as aws-mfa to get short-lived key+secret for your user.
Log into AWS web console
navigate to the AWS account that hosts your proxy instance (you may need to assume a role in that account)
then the region in that account in which your proxy instance is deployed. (default us-east-1
)
then search or navigate to the AWS Lambda
s feature, and find the specific one you wish to debug
find the tabs for Monitoring
then within that, Logging
, then click "go to Cloud Watch"
Unless your AWS CLI is auth'd as a user who can review logs, first auth it for such a role.
You can do this with a new profile, or setting env variables as follows:
Then, you can do a series of commands as follows:
Something like the following:
Your Terraform state is inconsistent. Run something like the following, adapted for your connector:
NOTE: you likely need to change outlook-mail
if your error is with a different data source. The \
chars are needed to escape the double-quotes/brackets in your bash command.
Something like the following:
Check:
the SSM parameter exists in the AWS account
the SSM parameter can be read by the lambda's execution rule (eg, has an attached IAM policy that allows the SSM parameter to be read; can test this with the AWS Policy Simulator, setting 'Role' to your lambda's execution role, 'Service' to 'AWS Systems Manager', 'Action' to 'Get Parameter' and 'Resource' to the SSM parameter's ARN.
the SSM parameter can be decrypted by the lambda's execution role (if it's encrypted with a KMS key)
Setting IS_DEVELOPMENT_MODE
to "true" in the Lambda's Env Vars via the console can enable some additional logging with detailed SSM error messages that will be helpful; but note that some of these errors will be expected in certain configurations.
Our Terraform examples should provide both of the above for you, but worth double-checking.
If those are present, yet the error persists, it's possible that you have some org-level security constraint/policy preventing SSM parameters from being used / read. For example, you have a "default deny" policy set for SSM GET actions/etc. In such a case, you need to add the execute roles for each lambda as exceptions to such policies (find these under AWS --> IAM --> Roles).
By default, Psoxy uses AWS Systems Manager Parameter Store to store secrets; this simplifies configuration and minimizes costs. However, you may want to use AWS Secrets Manager to store secrets due to organization policy.
In such a case, you can add the following to your terraform.tfvars
file:
This will alter the behavior of the Terraform modules to store everything considered a secret to be stored/loaded from AWS Secrets Manager instead of AWS Systems Manager Parameter Store. Note that Parameter Store is still used for non-secret configuration information, such as proxy rules, etc.
Changes will also be made to AWS IAM Policies, to allow lambda function execution roles to access Secrets Manager as needed.
If any secrets are managed outside of Terraform (such as API keys for certain connectors), you will need to grant access to relevant secrets in Secrets Manager to the principals that will manage these.
beta - This is now available for customer-use, but may still change in backwards incompatible ways.
Our aws-host
module provides a vpc_config
variable to specify the VPC configuration for the lambdas that our Terraform modules will create, analogous to the vpc_config
block supported by the AWS lambda terraform resource.
Some caveats:
API connectors on a VPC must be exposed via API Gateway rather than Function URLs (our Terraform modules will make this change for you).
VPC must be configured such that your lambda has connectivity to AWS services including S3, SSM, and CloudWatch Logs; this is typically done by adding a VPC Endpoint for each service.
VPC must allow any API connector to connect to data source APIs via HTTPS (eg 443); usually these APIs are on the public internet, so this means egress to public internet.
VPC must allow your API gateway to connect to your lambdas.
The requirements above MAY require you to modify your VPC configuration, and/or the security groups to support proxy deployment. The example we provide in our vpc.tf
should fulfill this if you adapt it; or you can use it as a reference to adapt you existing VPC.
To put the lambdas created by our terraform example under a VPC, please follow one of the approaches documented in the next sections.
If you have an existing VPC, you can use it with the vpc_config
variable by hard coding the ids of the pre-existing resources (provisioned outside the scope of your proxy's terraform configuration).
vpc.tf
If you don't have a pre-existing VPC, you wish to use, our aws example repo includes vpc.tf
file at the top-level. This file has a bunch of commented-out terraform resource blocks that can serve as examples for creating the minimal VPC + associated infra. Review and uncomment to meet your use-case.
Prerequisites:
the AWS principal (user or role) you're using to run Terraform must have permissions to manage VPCs, subnets, and security groups. The AWS managed policy AmazonVPCFullAccess
provides this.
all pre-requisites for the api-gateways (see api-gateway.md)
NOTE: if you provide vpc_config
, the value you pass for use_api_gateway_v2
will be ignored; using a VPC requires API Gateway v2, so will override value of this flag to true
.
Add the following to "psoxy" module in your main.tf
(or uncomment if already present):
Uncomment the relevant lines in vpc.tf
in the same directory, and modify as you wish. This file pulls the default VPC/subnet/security group for your AWS account under terraform.
Alternatively, you modify vpc.tf
to use a provision non-default VPC/subnet/security group, and reference those from your main.tf
- subject to the caveats above.
See the following terraform resources that you'll likely need:
Check your Cloud Watch logs for the lambda. Proxy lambda will time out in INIT phase if SSM Parameter Store or your secret store implementation (AWS Secrets Manager, Vault) is not reachable.
Some potential causes of this:
DNS failure - it's going to look up the SSM service by domain; if the DNS zone for the SSM endpoint you've provisioned is not published on the VPC, this will fail; similarly, if the endpoint wasn't configured on a subnet - then it won't have an IP to be resolved.
if the IP is resolved, you should see failure to connect to it in the logs (timeouts); check that your security groups for lambda/subnet/endpoint allow bidirectional traffic necessary for your lambda to retrieve data from SSM via the REST API.
Terraform with aws provider doesn't seem to play nice with lambdas/subnets; the subnet can't be destroyed w/o destroying the lambda, but terraform seems unaware of this and will just wait forever.
So:
destroy all your lambdas (terraform state list | grep aws_lambda_function
; then terraform destroy --target=
for each, remember '' as needed)
destroy the subnet terraform destroy --target=aws_subnet.main
https://docs.aws.amazon.com/lambda/latest/dg/foundation-networking.html
https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc.html
Tips and tricks for using GCP as to host the proxy.
Some orgs have policies that block authentication of the GCloud CLI client, requiring you to contact your IT team and have it added to an approved list. Apart from that, there are several possibilities:
use the GCP Cloud Shell (via GCP web console). gcloud
is pre-installed and pre-authorized as your Google user in the Cloud Shell.
use a VM in GCP Compute Engine, with the VM running as a sufficiently privileged service account. In such a scenario, gcloud
will be pre-authenticated by GCP on the VM as that service account.
create credentials within the project itself:
enable IAM API and Cloud Resource Manager API within the project
create OAuth credentials for a 'desktop application' within the target GCP project
download the client-secrets.json
file to your environment
run gcloud auth application-default login --client-id-file=/path/to/client-secrets.json
Terraform relies on GCP's REST APIs for its operations. If these APIs are disabled either the target project OR the project in which the identity (service account, OAuth client) under which you're running terraform resides, you may get an error.
The solution is to enable APIs via the Cloud Console, specifically:
IAM API
Cloud Resource Manager API
If some resources seem to not be properly provisioned, try terraform taint
or terraform state rm
, to force re-creation. Use terrafrom state list | grep
to search for specific resource ids.
If you receive an error such as:
You may need define an exception for the GCP project in which you're deploying the proxy, or add the domain of your Worklytics Tenant SA to the list of allowed domains.
If you use Psoxy to send pseudonymized data to Worklytics and later wish to re-identify the data that you export from Worklytics to your premises, you'll need a lookup table in your data warehouse to JOIN with that data.
Our aws-host
Terraform module, as used in our Psoxy AWS Example, provides a variable lookup_table_builders
to control generation of these lookup tables.
Populating this variable will generate another version of your HRIS data (aside from the one exposed to Worklytics) which you can then import back to your data warehouse.
To enable it, add the following to your terraform.tfvars
file:
In sanitized_accessor_role_names
, add the name of whatever AWS role that the principal running ingestion of your lookup table from S3 to your data warehouse will assume. You can add additional role names as needed. Alternatively, you can use an IAM policy created outside of our Terraform module to grant access to the lookup table CSVs within the S3 bucket.
After you apply this configuration, the lookup table will be generated in an S3 bucket. The S3 bucket will be shown in the Terraform output:
Use the bucket name shown in your output to build import pipeline to your data warehouse.
If your input file follows the standard HRIS schema for Worklytics, it will have SNAPSHOT,EMPLOYEE_ID,EMPLOYEE_EMAIL,JOIN_DATE,LEAVE_DATE,MANAGER_ID
columns, at minimum.
Every time a new hris snapshot is uploaded to the hris -input
bucket, TWO copies of it will be created: a sanitized copy in the bucket accessible Worklytics, and the lookup variant in the lookup bucket referenced above (not accessible to Worklytics).
The lookup table CSV file will have the following columns: EMPLOYEE_ID,EMPLOYEE_ID_ORIG
If you load this into your Data Warehouse, you can JOIN it with the data you export from Worklytics.
Eg, assuming you've exported the Worklytics Weekly aggregates data set to your data warehouse, load the files from S3 bucket above into a table named lookup_hris
.
Then the following query will give re-identified aggregate data:
The employeeId
column in the result set will be the original employee ID from your HRIS system.
If your HRIS employee ID column is considered PII, then the lookup table and any re-identified data exports you use it to produce should be handled as Personal data, according to your policies, as these now reference readily identifiable Natural Persons.
If you wish limit re-identification to a subset of your data, you can use additional columns present in your HRIS csv to do so, for example:
Within the lookup_table_builders
map, you can specify the following fields:
input_connector_id
- usually hris
; this corresponds the whatever bulk connector you want to build the lookup table for.
rules
- this follows the rules structure for the bulk connector case. The example above is suited for HRIS data following the schema expected by Worklytics. If you modify this, be sure to review our documentation or contact support to ensure you don't break your lookup table.
A serverless, pseudonymizing, DLP layer between Worklytics and the REST API of your data sources.
Psoxy replaces PII in your organization's data with hash tokens to enable Worklytics's analysis to be performed on anonymized data which we cannot map back to any identifiable individual.
Psoxy is a pseudonymization service that acts as a Security / Compliance layer, which you can deploy between your data sources (SaaS tool APIs, Cloud storage buckets, etc) and the tools that need to access those sources.
Psoxy ensures more secure, granular data access than direct connections between your tools will offer - and enforces access rules to fulfill your Compliance requirements.
Psoxy functions as API-level Data Loss Prevention layer (DLP), by blocking sensitive fields / values / endpoints that would otherwise be exposed when you connect a data sources API to a 3rd party service. It can ensure that data which would otherwise be exposed to a 3rd party service, due to granularity of source API models/permissions, is not accessed or transfered to the service.
Objectives:
serverless - we strive to minimize the moving pieces required to run psoxy at scale, keeping your attack surface small and operational complexity low. Furthermore, we define infrastructure-as-code to ease setup.
transparent - psoxy's source code is available to customers, to facilitate code review and white box penetration testing.
simple - psoxy's functionality will focus on performing secure authentication with the 3rd party API and then perform minimal transformation on the response (pseudonymization, field redaction) to ease code review and auditing of its behavior.
Psoxy may be hosted in Google Cloud or AWS.
A Psoxy instances reside on your premises (in the cloud) and act as an intermediary between Worklytics and the data source you wish to connect. In this role, the proxy performs the authentication necessary to connect to the data source's API and then any required transformation (such as pseudonymization or redaction) on the response.
Orchestration continues to be performed on the Worklytics side.
Source API data may include PII such as:
But Psoxy ensures Worklytics only sees:
These pseudonyms leverage SHA-256 hashing / AES encryption, with salt/keys that are known only to your organization and never transferred to Worklytics.
Psoxy enforces that Worklytics can only access API endpoints you've configured (principle of least privilege) using HTTP methods you allow (eg, limit to GET
to enforce read-only for RESTful APIs).
For data sources APIs which require keys/secrets for authentication, such values remain stored in your premises and are never accessible to Worklytics.
You authorize your Worklytics tenant to access your proxy instance(s) via the IAM platform of your cloud host.
Worklytics authenticates your tenant with your cloud host via Workload Identity Federation. This eliminates the need for any secrets to be exchanged between your organization and Worklytics, or the use any API keys/certificates for Worklytics which you would need to rotate.
See also: API Data Sanitization
As of March 2023, the following sources can be connected to Worklytics via psoxy.
Note: Some sources require specific licenses to transfer data via the APIs/endpoints used by Worklytics, or impose some per API request costs for such transfers.
For all of these, a Google Workspace Admin must authorize the Google OAuth client you provision (with provided terraform modules) to access your organization's data. This requires a Domain-wide Delegation grant with a set of scopes specific to each data source, via the Google Workspace Admin Console.
If you use our provided Terraform modules, specific instructions that you can pass to the Google Workspace Admin will be output for you.
Google Calendar
calendar.readonly
Google Chat
admin.reports.audit.readonly
Google Directory
admin.directory.user.readonly admin.directory.user.alias.readonly admin.directory.domain.readonly admin.directory.group.readonly admin.directory.group.member.readonly admin.directory.orgunit.readonly
Google Drive
drive.metadata.readonly
GMail
gmail.metadata
Google Meet
admin.reports.audit.readonly
NOTE: the above scopes are copied from infra/modules/worklytics-connector-specs. Please refer to that module for a definitive list.
NOTE: 'Google Directory' connection is required prerequisite for all other Google Workspace connectors.
NOTE: you may need to enable the various Google Workspace APIs within the GCP project in which you provision the OAuth Clients. If you use our provided terraform modules, this is done automatically.
NOTE: the above OAuth scopes omit the https://www.googleapis.com/auth/
prefix. See OAuth 2.0 Scopes for Google APIs for details of scopes.
See details: sources/google-workspace/README.md
For all of these, a Microsoft 365 Admin (at minimum, a Privileged Role Administrator) must authorize the Azure Application you provision (with provided terraform modules) to access your Microsoft 365 tenant's data with the scopes listed below. This is done via the Azure Portal (Active Directory). If you use our provided Terraform modules, specific instructions that you can pass to the Microsoft 365 Admin will be output for you.
Entra ID (former Active Directory)
Calendar
Teams (beta)
NOTE: the above scopes are copied from infra/modules/worklytics-connector-specs./ Please refer to that module for a definitive list.
NOTE: usage of the Microsoft Teams APIs may be billable, depending on your Microsoft 365 licenses and level of Teams usage. Please review: Payment models and licensing requirements for Microsoft Teams APIs
See details: sources/microsoft-365/README.md
These sources will typically require some kind of "Admin" within the tool to create an API key or client, grant the client access to your organization's data, and provide you with the API key/secret which you must provide as a configuration value in your proxy deployment.
The API key/secret will be used to authenticate with the source's REST API and access the data.
Asana
GitHub
Read Only permissions for: Repository: Contents, Issues, Metadata, Pull requests Organization: Administration, Members
Jira Cloud
"Classic Scopes": read:jira-user
read:jira-work
"Granular Scopes": read:group:jira
read:user:jira
"User Identity API" read:account
Jira Server / Data Center
Personal Acccess Token on behalf of user with access to equivalent of above scopes for entire instance
Salesforce
api
chatter_api
refresh_token
offline_access
openid
lightning
content
cdp_query_api
Slack
discovery:read
Zoom
meeting:read:past_meeting:admin
meeting:read:meeting:admin
meeting:read:list_past_participants:admin
meeting:read:list_past_instances:admin
meeting:read:list_meetings:admin
meeting:read:participant:admin
report:read:list_meeting_participants:admin
report:read:meeting:admin
report:read:user:admin
user:read:user:admin
user:read:list_users:admin
NOTE: the above scopes are copied from infra/modules/worklytics-connector-specs. Please refer to that module for a definitive list.
Other data sources, such as Human Resource Information System (HRIS), Badge, or Survey data can be exported to a CSV file. The "bulk" mode of the proxy can be used to pseudonymize these files by copying/uploading the original to a cloud storage bucket (GCS, S3, etc), which will trigger the proxy to sanitize the file and write the result to a 2nd storage bucket, which you then grant Worklytics access to read.
Alternatively, the proxy can be used as a command line tool to pseudonymize arbitrary CSV files (eg, exports from your HRIS), in a manner consistent with how a psoxy instance will pseudonymize identifiers in a target REST API. This is REQUIRED if you want SaaS accounts to be linked with HRIS data for analysis (eg, Worklytics will match email set in HRIS with email set in SaaS tool's account so these must be pseudonymized using an equivalent algorithm and secret). See java/impl/cmd-line/
for details.
See also: Bulk File Sanitization
The prequisites and dependencies you will need for Psoxy are determined by:
Where you will host psoxy? eg, Amazon Web Services (AWS), or Google Cloud Platform (GCP)
Which data sources you will connect to? eg, Microsoft 365, Google Workspace, Zoom, etc, as defined in previous sections.
Once you've gathered that information, you can identify the required software and permissions in the next section, and the best environment from which to deploy Psoxy.
At a high-level, you need 3 things:
a cloud host platform account to which you will deploy Psoxy (eg, AWS account or GCP project)
an environment on which you will run the deployment tools (usually your laptop)
some way to authenticate that environment with your host platform as an entity with sufficient permissions to perform the deployment. (usually an AWS IAM Role or a GCP Service Account, which your personal AWS or Google user can assume).
You, or the IAM Role / GCP Service account you use to deploy Psoxy, usually does NOT need to be authorized to access or manage your data sources directly. Data access permissions and steps to grant those vary by data source and generally require action to be taken by the data source administrator AFTER you have deployed Psoxy.
As of Feb 2023, Psoxy is implemented with Java 11 and built via Maven. The proxy infrastructure is provisioned and the Psoxy code deployed using Terraform, relying on Azure, Google Cloud, and/or AWS command line tools.
You will need all the following in your deployment environment (eg, your laptop):
2.17+
git --version
3.6+
mvn -v
11, 17, 21 (see notes)
mvn -v | grep Java
1.3+, <= 1.9
terraform version
NOTE: we will support Java versions for duration of official support windows, in particular the LTS versions. As of Nov 2023, we still support java 11 but may end this at any time. Minor versions, such as 12-16, and 18-20, which are out of official support, may work but are not routinely tested.
NOTE: Using terraform
is not strictly necessary, but it is the only supported method. You may provision your infrastructure via your host's CLI, web console, or another infrastructure provisioning tool, but we don't offer documentation or support in doing so. Adapting one of our terraform examples or writing your own config that re-uses our modules will simplify things greatly.
NOTE: Refrain to use Terraform versions 1.4.x that are < v1.4.3. We've seen bugs.
NOTE: from v0.4.59, we've relaxed Terraform version constraint on our modules to allow up to 1.9.x. However, we are not officially supporting this, as we strive to maintain compatibility with both OpenTofu and Terraform.
Depending on your Cloud Host / Data Sources, you will need:
if deploying to AWS
aws --version
if deploying to GCP
gcloud version
if connecting to Microsoft 365
az --version
if connecting to Google Workspace
gcloud version
For testing your psoxy instance, you will need:
16+ (ideally, an LTS version)
node --version
8+
npm --version
NOTE: Node.js v16 is unmaintained since Oct 2023, so we recommend a newer version: v18, v20. Some Node.js versions (e.g. v21) may display warning messages when running the test scripts.
We provide a script to check these prereqs, at tools/check-prereqs.sh
. That script has no dependencies itself, so should be able to run on any plain POSIX-compliant shell (eg,bash
, zsh
, etc) that we'd expect you to find on most Linux, MacOS, or even Windows with Subsystem for Linux (WSL) platforms.
Choose the cloud platform you'll deploy to, and follow its 'Getting Started' guide:
Based on that choice, pick from the example template repos below. Use your choosen option as a template to create a new GitHub repo, or if you're not using GitHub Cloud, create clone/fork of the choosen option in your source control system:
AWS - https://github.com/Worklytics/psoxy-example-aws
GCP - https://github.com/Worklytics/psoxy-example-gcp
You will make changes to the files contained in this repo as appropriate for your use-case. These changes should be committed to a repo that is accessible to other members of your team who may need to support your Psoxy deployment in the future.
Pick the location from which you will deploy (provision) the psoxy instance. This location will need the software prereqs defined in the previous section. Some suggestions:
your local machine; if you have the prereqs installed and can authenticate it with your host platform (AWS/GCP) as a sufficiently privileged user/role, this is a simple option
Google Cloud Shell - if you're using GCP and/or connecting to Google Workspace, this is option simplifies authentication. It includes the prereqs above EXCEPT aws/azure CLIs out-of-the-box.
Terraform Cloud - this works, but adds complexity of authenticating it with you host platform (AWS/GCP)
Ubuntu Linux VM/Container - we provide some setup instructions covering prereq installation for Ubuntu variants of Linux, and specific authentication help for:
Follow the 'Setup' steps in the READMEs of those repos, ultimately running terraform apply
to deploy your Psoxy instance(s).
follow any TODO
instructions produced by Terraform, such as:
provision API keys / make OAuth grants needed by each Data Connection
create the Data Connection from Worklytics to your psoxy instance (Terraform can provide TODO
file with detailed steps for each)
Various test commands are provided in local files, as the output of the Terraform; you may use these examples to validate the performance of the proxy. Please review the proxy behavior and adapt the rules as needed. Customers needing assistance adapting the proxy behavior for their needs can contact support@worklytics.co
Java
Terraform Examples
Tools
Review release notes in GitHub.
Psoxy is maintained by Worklytics, Co. Support as well as professional services to assist with configuration and customization are available. Please contact sales@worklytics.co for more information or visit www.worklytics.co.
clone the repo (or a of it)
if using Microsoft 365 sources, install and authenticate Azure CLI
https://docs.microsoft.com/en-us/cli/azure/install-azure-cli
if deploying AWS infra, and authenticate AWS CLI
You should now be ready for the general instructions in the .
You'll provision infrastructure that ultimately looks as follows:
This includes:
Cloud Functions
Service Accounts
Secret Manager Secrets, to hold pseudonymization salt, encryption keys, and data source API keys
Cloud Storage Buckets (GCS), if using psoxy to sanitize bulk file data, such as CSVs
NOTE: if you're connecting to Google Workspace as a data source, you'll also need to provision Service Account Keys and activate Google Workspace APIs.
a Google Project
we recommend a dedicated GCP project for your deployment, to provide an implicit security boundary around your infrastructure as well as simplify monitoring/cleanup
a GCP (Google) user or Service Account with permissions to provision Service Accounts, Secrets, Storage Buckets, Cloud Functions, and enable APIs within that project. eg:
additional APIs enabled in the project: (using the Service Usage API
above, our Terraform will attempt to enable these, but as there is sometimes a few minutes delay in activation and in some cases they are required to read your existing infra prior to apply, you may experience errors. To pre-empt those, we suggest ensuring the following are enabled:
You'll also need a secure backend location for your Terraform state (such as a GCS or S3 bucket). It need not be in the same host platform/project/account to which you are deploying the proxy, as long as the Google/AWS user you are authenticated as when running Terraform has permissions to access it.
Some options:
GCS : https://developer.hashicorp.com/terraform/language/settings/backends/gcs
S3 : https://developer.hashicorp.com/terraform/language/settings/backends/s3
Alternatively, you may use a local file system, but this is not recommended for production use - as your Terraform state may contain secrets such as API keys, depending on the sources you connect.
See: https://developer.hashicorp.com/terraform/language/settings/backends/local
The https://github.com/Worklytics/psoxy-example-gcp repo provides an example configuration for hosting proxy instances in GCP. You use that template, following it's Usage
docs to get started.
the 'Service Account' approach described in the prerequisites is preferable to giving a Google user account IAM roles to administer your infrastructure directly. You can pass this Service Account's email address to Terraform by setting the gcp_terraform_sa_account_email
. Your machine/environments CLI must be authenticated as GCP entity which can impersonate this Service Account, and likely create tokens as it (Service Account Token Creator
role).
using a dedicated GCP project is superior to using a shared project, as it provides an implicit security boundary around your infrastructure as well as simplifying monitoring/cleanup. The IAM roles specified in the prerequisites must be granted at the project level, so any non-Proxy infrastructure within the GCP project that hosts your proxy instances will be accessible to the user / service account who's managing the proxy infrastructure.
This page provides an overview of how psoxy authenticates and confirms authorization of clients (Worklytics tenants) to access data for GCP-hosted deployments.
For general overview of how Psoxy is authorized to access data sources, and authenticates when making API calls to those sources, see .
As Worklytics tenants run inside GCP, they are implicitly authenticated by GCP. No secrets or keys need be exchanged between your Worklytics tenant and your Psoxy instance. GCP can verify the identity of requests from Worklytics to your instance, just as it does between any process and resource within GCP.
Invocations of your proxy instances are authorized by the IAM policies you define in GCP. For API connectors, you grant the Cloud Function Invoker role to your Worklytics tenant's GCP service account on the Cloud Function for your instance.
For the bulk data case, you grant the Storage Object Viewer role to your Worklytics tenant's GCP service account on the sanitized output bucket for your connector.
You can obtain the identity of your Worklytics tenant's GCP service account from the Worklytics portal.
The apply (java, maven, etc).
With those, you can can run locally via IntelliJ, using run configs (located in .idea/runConfigurations
):
package install core
builds the core JAR, on which implementations depend
gcp - run gmail
builds and runs a local instance for GMail
Or from command line:
By default, that serves the function from http://localhost:8080.
1.) run terraform init
and terraform apply
from infra/dev-personal
to provision environment
2.) run locally via IntelliJ run config
3.) execute the following to verify your proxy is working OK
Health check (verifies that your client can reach and invoke the proxy at all; and that is has sensible config)
Using a message id you grab from that:
1.) deploy to GCP using Terraform (see infra/
). Follow steps in any TODO files it generates.
2.) Set your env vars: (these should be in a TODO file generated by terraform in prev step
3.) grant yourself access (probably not needed if you have primitive role in project, like Owner or Editor)
alternatively, you can add Terraform resource for this to your Terraform config, and apply it again:
Either way, if this function is for prod use, please remove these grants after you're finished testing.
4.) invocation examples
This may be due to an that restricts the domains that can be used in IAM policies. See https://cloud.google.com/resource-manager/docs/organization-policy/restricting-domains
-
-
-
-
-
-
-
-
-
-
a (provides full access to Workspace)
2.2+
see
1.0+
see
2.29+
1.0+
see
(should come with node
)
- proxy instances are deployed as GCP cloud functions
- processing of bulk data (such as HRIS exports) uses GCS buckets
- create custom roles for the proxy, to follow principle of least privilege
- your API keys and pseudonymization salt is stored in Secret Manager
- admin Service Accounts that personify Cloud Functions or are used as Google Workspace API connections
- you will need to enable various GCP APIs
the following APIs enabled in the project: (via )
(iamcredentials.googleapis.com
) - generally needed to support authenticating Terraform. May not be needed if you're running terraform
within a GCP environment.
(serviceusage.googleapis.com
)
(compute.googleapis.com
)
(cloudbuild.googleapis.com
)
(cloudfunctions.googleapis.com
)
(cloudresourcemanager.googleapis.com
)
(iam.googleapis.com
)
(secretmanager.googleapis.com
)
(storage-api.googleapis.com
)
For some help in bootstraping a GCP environment, see also:
The module is a dependency-free module that provides lists of GCP roles, etc needed for bootstraping a GCP project in which your proxy instances will reside.
Node.js testing tool for Worklytics Psoxy.
We provide a collection of Node.js scripts to help you test your Worklytics Psoxy deploy. The requirements to be able to run the scripts are Node.js (version >=16) and npm (version >=8). First of all, install the npm dependencies: npm i
.
The primary tool is a command line interface (CLI) script that allows you to execute "Psoxy Test Calls" to your Worklytics Psoxy instance. Check all the available options by running node cli-call.js -h
(*).
We also provide a script to test "Psoxy bulk instances": they consist of an input bucket, an output one, and the Psoxy instance itself. The script allows you to upload a comma-separated values file (CSV) to the input bucket, it will check that the Psoxy has processed the file and have written it to the output bucket removing all Personal Identifiable Information (PII) from the file (as per Psoxy rules). Check available options by running node cli-file-upload.js -h
(*).
A third script lets you check your Psoxy instance logs: node cli-logs.js -h
(*).
(*) Options may vary depending on whether you've deployed the Worklytics Psoxy to Amazon Web Services (AWS) or Google Cloud Platform (GCP).
Assuming that you've successfully deployed the Psoxy to AWS, and you've configured Google Calendar as data source, let's see an example:
The -r
option is mandatory for AWS deploys, and identifies the Amazon Resource Name (ARN) of the "role" that will be assumed (*) to be able to execute the call. The -u
option is the URL you want to test. In this case, the URL's path matches a Google Calendar API endpoint (access the primary calendar of the currently logged-in user). The -i
option identifies the user "to impersonate"; this option is only relevant for Google Workspace data sources.
Another example for Zoom:
As you can see, the differences are:
As this is not a Google Workspace data source, you don't need the -i
option.
The URL's path matches a Zoom API endpoint in this case
(*) Requests to AWS API need to be signed, so you must ensure that the machine running these scripts have the appropriate AWS credentials for the role you've selected.
For GCP, every call needs an "identity token" (-t, --token
option in the examples below) for the account that has access to the Cloud Platform (*). If you omit the token, the script will try to get it automatically, so you must authorize gcloud first.
Google Calendar example:
Zoom example:
Outlook Calendar example (token option omitted):
(*) You can obtain it by running gcloud auth print-identity-token
(using Google Cloud SDK)
Use the --health-check
option to check if your deploy is correctly configured:
Example response for Zoom:
The -d, --data-source
option of our CLI script allows you to test all the endpoints for a given data source (available data sources are listed in the script's help: -h
option). The only difference with the previous examples is that the -u, --url
option has to be the URL of the deploy without the corresponding API path of the data source:
Notice how the URL changes, and any other option the Psoxy may need doesn't.
Assuming that you've successfully deployed the Psoxy to AWS, you can inspect the logs by running the following command:
Use the following command to review the runtime logs of your Psoxy deploy to GCP:
The <projectId>
option is the Google Cloud project identifier that hosts your Psoxy deploy, and the <functionName>
option is the identifier of the Cloud Function that represents the Psoxy instance itself.
Assuming that you've successfully deployed the Psoxy "bulk instance" to AWS, you need to provide the script with a CSV example file containing some PII records, the name of the input bucket and the output one (these are expected to be S3 buckets in the same AWS region). The script also needs the AWS region (default is us-east-1
), and the ARN of the role that will be assumed to perform the upload and download operations.
Example:
Use the following command to test a Psoxy "bulk" instance deployed to GCP:
In this case, -i
and -o
options represent Google Cloud Storage buckets.
The testing script will rename the files you upload by appending a timestamp value as suffix: my-test-file.csv
will appear as my-test-file-{timestamp}.csv
in both the input and output buckets. This is done to avoid conflicts with files that may already exist in the buckets.
By default, the sanitized file will be deleted from the output bucket after the comparison test (original file vs. sanitized one). Run node cli-file-upload.js -h
to see all the available options (keep sanitized file in the output bucket, save it to disk, etc).
There are two approaches to upgrade you Proxy to a newer version.
In both cases, you should carefully review your next terraform plan
or terraform apply
for changes to ensure you understand what will be created, modified, or destroyed by the upgrade.
If you have doubts, review CHANGELOG.md
for highlights of significant changes in each version; and detailed release notes for each release:
https://github.com/Worklytics/psoxy/releases
upgrade-terraform-modules
ScriptIf you originally used one of our example repos (psoxy-example-aws or psoxy-example-gcp, etc), starting from version v0.4.30
, you can use the following command leveraging a script creating when you initialized the example:
This will update all the versions references throughout your example, and offer you a command to revert if you later wish to do so.
Open each .tf
file in the root of your configuration. Find all module references ending in a version number, and update them to the new version.
Eg, look for something like the following:
update the v0.4.37
to v0.4.46
:
Then run terraform init
after saving the file to download the new version of each module(s).
By default, the Terraform examples provided by Worklytics install a NodeJS-based tool for testing your proxy deployments.
Full documentation of the test tool is available here. And the code is located in the tools
directory of the Psoxy repository.
Wherever you run this test tool from, your AWS or GCloud CLI must be authenticated as an entity with permissions to invoke the Lambda functions / Cloud functions that you deployed for Psoxy.
If you're testing the bulk cases, the entity must be able to read/write to the cloud storage buckets created for each of those bulk examples.
If you're running the Terraform examples in a different location from where you wish to run tests, then you can install the tool alone:
Clone the Psoxy repo to your local machine:
From within that clone, install the test tool:
Get specific test commands for your deployment
If you set the todos_as_outputs
variable to true
, your Terraform apply run should contain todo2
output variable with testing instructions.
If you set todos_as_local_files
variable to true
, your Terraform apply run should contain local files named TODO 2 ...
with testing instructions.
In both cases, you will need to replace the test tool path included there with the path to your installation.
Example commands of the primary testing tool: "Psoxy Test Calls"
If you used and approach other than Terraform, or did not directly use our Terraform examples, you may not have the testing examples or the test tool installed on your machine.
In such a case, you can install the test tool manually by following steps 1+2 above, and then can review the documentation on how to use it from your machine.
This document describes how to migrate your deployment from one cloud provider to another, or one project/account to another. It does not cover migrating between proxy versions.
Use cases:
move from a dev
account to a prod
account (Account / Project Migration)
move from a "shared" account to a "dedicated" account (Account / Project Migration)
move from AWS --> GCP, and vice versa (Provider Migration)
Some data/infrastructure MUST, or at least SHOULD be preserved during your migration. Below is an enumeration of both cases.
Data, such as configuration values, can generally be copied; you just need to make a new copy of it in the new environment managed by the new Terraform configuration.
Some infrastructure, such as API Clients, will be moved; eg, the same underlying resource will continue to exist, it will just be managed by the new Terraform configuration instead of the old one. This is the more tedious case, as you must both import
this infrastructure to your new configuration and then rm
(remove) it from your old configuration, rather than having it be destroy
ed when you teardown the old configuration. You should carefully review every terraform apply
, including terraform destroy
commands, to ensure that infrastructure you intend to move is not destroyed, or replaced (eg, terraform sees it as tainted, and does a destroy
+ create
within a single apply
operation).
What you MUST copy:
SALT
value. This is a secret used to generate the pseudonyms. If this is lost/destroyed, you will be unable to link any data pseudonymized with the original salt to data you process in the future.
NOTE: the underlying resource to preserve is actually a random_password
resource, not an SSM parameter / GCP Secret - because those simply are being filled from the terraform random_password
resource; if you import parameter/secret, but not the random_password
, Terraform will generate a new value and overwrite the parameter/secret.
as of v0.4.35 examples, the terraform resource ID for this value is expected to be module.psoxy.module.psoxy.random_password.pseudonym_salt
; if not, you can search for it with terraform state list | grep random_password
value for PSEUDONYMIZE_APP_IDS
. This value, if set to true
will have the proxy use a rule set that pseudonymizes identifiers issued by source applications themselves in some cases where these identifiers aren't inherently PII - but the association could be considered discoverable.
value for EMAIL_CANONICALIZATION
. prior to v0.4.52, this default was in effect STRICT
; so if your original deployment was built on a version prior to this, you should explicitly set this value to STRICT
in your new configuration (likely email_canonicalization
variable in terraform modules)
any custom sanitization rules that you've set, either in your Terraform configuration or directly as the value of a RULES
environment variable, SSM Parameter, or GCP Secret.
historical sanitized files for any bulk connectors, if you wish to continue to have this data analyzed by Worklytics. (eg, everything from all your -sanitized
buckets)
NOTE: you do NOT need to copy the ENCRYPTION_KEY
value; rotation of this value should be expected by clients.
What you SHOULD move:
API Clients. Whether generated by Terraform or not, the "API Client" for a data source must typically be authorized by a data source administrator to grant it access to the data source. As such, if you destroy the client, or lose its id, you'll need to coordinate with the administrator again to recreate it / obtain the configuration information.
as of v0.4.35
, Google Workspace and Microsoft 365 API clients are managed directly by Terraform, so these are important to preserve.
What you SHOULD copy:
API Client Secrets, if generated outside of Terraform. If you destroy/lose these values, you'll need to contact the data source administrator to obtain new versions.
Prior to beginning your migration, you should make a list of what existing infrastructure and/or configuration values you intend to move/copy.
The following is a rough guide on the steps you need to take to migrate your deployment.
Salt value. If using an example forked from our template repos at v0.4.35
or later, you can find the output
block in your main.tf
for pseudonym_salt
, uncomment it, run terraform apply
. You'll then be able to obtain the value with: terraform output --raw pseudonym_salt
On macOS, you can copy the value to your clipboard with: terraform output --raw pseudonym_salt | pbcopy
Microsoft 365 API client, if any:
Find the resource ids: terraform state list | grep "\.azuread_application\."
For each, obtain it's objectId
: terraform state show 'module.psoxy.module.msft-connection["azure-ad"].azuread_application.connector'
Prepare import command for each client for your new configuration, eg: terraform import 'module.psoxy.module.msft-connection["azure-ad"].azuread_application.connector' '<objectId>'
Google Workspace API clients, if any:
Find the resource ids: tf state list | grep 'google_service_account\.connector-sa'
For each, obtain its unique_id
: terraform state show 'module.worklytics_connectors_google_workspace.module.google_workspace_connection["gdirectory"].google_service_account.connector-sa'
Prepare import command for each client for your new configuration, eg: terraform import 'module.worklytics_connectors_google_workspace.module.google_workspace_connection["gdirectory"].google_service_account.connector-sa' '<unique_id>'
Create a new Terraform configuration from scratch; run terraform init
there (if you begin with one of our examples, our init
script does this). Use the terraform.tfvars
of your existing configuration as a guide for what variables to set, copying over any needed values.
Run a provisional terraform plan
and review.
Run the imports you prepared in Phase 1, if all appear OK, run another terraform plan
and review (comparing to the old one).
Optionally, run terraform plan -out=plan.out
to create a plan file; if you send this, along with all the *.tf
/*.tfvars
files to Worklytics, we can review it and confirm that it is correct.
Run terraform apply
to create the new infrastructure; re-confirm that the plan is not re-creating any API clients/etc that you intended to preserve
Via AWS / GCP console, or CLIs, move the values of any secrets/parameters that you intend to by directly reading the values from your old account/project, and copying them into the new account/project
Look at the TODO 3
files/output variables for all your connectors. Make a mapping between the old values and the new values. Send this to Worklytics. It should include for each the proxy URLs, AWS Role to use, and any other values that are changing.
Wait for confirmation that Worklytics has migrated all your connections to the new values. This may take 1-2 days.
Remove references to any API Clients you migrated in Phase 1:
eg, terraform state rm 'module.psoxy.module.msft-connection["azure-ad"].azuread_application.connector'
run terraform destroy
in the old configuration. Carefully review the plan before confirming.
if you're using Google Workspace sources, you may see destruction of google_project_service
resources; if you allow these to be destroyed, these APIS will be disabled; if you are using the same GCP project in your other configuration, you should run terraform apply
there again to re-enable them.
You may also destroy any API clients/etc that are managed outside of Terraform and which you did not migrate to the new environment.
You may clean up any configuration values, such as SSM Parameters / GCP Secrets to customize the proxy rules sets, that you may have created in your old host environment.
If you're using Terraform Cloud or Enterprise, here are a few things to keep in mind.
NOTE: this is tested only for gcp; for aws YMMV, and in particular we expect Microsoft 365 sources will not work properly, given how those are authenticated.
Prereqs:
git/java/maven, as described here https://github.com/Worklytics/psoxy#required-software-and-permissions
for testing, you'll need the CLI of your host environment (eg, AWS CLI, GCloud CLI, Azure CLI) as well as npm/NodeJS installed on your local machine
After authenticating your terraform CLI to Terraform Cloud/enterprise, you'll need to:
Create a Project in Terraform Cloud; and a workspace within the project.
Clone one of our example repos and run the ./init
script to initialize your terraform.tfvars
for Terraform Cloud. This will also put a bunch of useful tooling on your machine.
3. Commit the bundle that was output by the ./init
script to your repo:
Change the terraform backend main.tf
to point to your Terraform Cloud rather than be local
remove backend
block from main.tf
add a cloud
block within the terraform
block in main.tf
(obtain content from your Terraform Cloud)
run terraform init
to migrate the initial "local" state to the remote state in Terraform Cloud
You'll have to authenticate your Terraform Cloud with Google / AWS / Azure, depending on the cloud you're deploying to / data sources you're using.
If you're using Terraform Cloud or Enterprise, our convention of writing "TODOs" to the local file system might not work for you.
To address this, we've updated most of our examples to also output todo values as Terraform outputs, todos_1
, todos_2
, etc.
To get them nicely on your local machine, something like the following:
get an API token from your Terraform Cloud or Enterprise instance (eg, https://developer.hashicorp.com/terraform/cloud-docs/users-teams-organizations/api-tokens).
set it as an env variable, as well as the host:
run a curl command using those values to get each todos:
If you have terraform
CLI auth'd against your Terraform Cloud or Enterprise instance, then you might be able to avoid the curl-hackery above, and instead use the following:
(This approach should also work with Terraform CLI running with backend
, rather than cloud
)
As Terraform Cloud runs remotely, the test tool we provide for testing your deployment will not be available by default on your local machine. You can install it locally and adapt the suggestions from the todos_2
output variable of your terraform run to test your deployment from your local machine or another environment. See testing.md for details.
If you have run our init
script locally (as suggested in 'Getting Started') then the test tool should have been installed (likely at .terraform/modules/psoxy/tools/
). You will need to update everything in todos_2.md
to point to this path for those test commands to work.
If you need to directly install/re-install it, something like the following should work:
This guide provides a roadmap of a typical implementation with Worklytics-provided support.
30-60 min video call to get overview of process, responsibilities
Attendees:
Product Stakeholder(s)
Data Source Administrator(s), if identified
IT Admin(s), if identified
Agenda:
determine data sources, and who can authorize access to each
determine host platform (GCP or AWS)
identify who has the permissions to manage infra, will be able to run Terraform, and how they'll run it (where, authenticated how)
scope desired data interval, approximate headcount, etc.
identify any potential integration issues or infrastructure constraints
1-2 hr video call, to walk-through customization and initial terraform runs via screenshare
Attendees:
IT Admin(s) who will be running Terraform
Worklytics technical contact
Prior to this call, please follow the initial steps in the Getting Started
section for your host platform and ensure you have all Prereqs
Goals:
get example customized and a terraform plan working.
run terraform apply
. Obtain the TODO 1
files you can send to your data source administrators to complete, as needed.
Tips:
Works best if we screenshare
can be completed without call; but Worklytics can assist if desired
follow TODO 2
files / use test *.sh
shell scripts produced by terraform apply
validate that authentication/authorization is correct for all connections, and that you're satisfied with proxy behavior
Further guidance on proxy testing:
https://docs.worklytics.co/psoxy/guides/testing
can be completed without call; but Worklytics can assist if desired
Authorize Worklytics to invoke API connectors and access sanitized bulk data:
obtain service account ID of your tenant from Worklytics (via Worklytics web portal)
configure it in your terraform.tfvars
file (details below)
run terraform apply
again to update IAM policy to reflect the change
For AWS-hosted case, add the numeric ID of your Worklytics tenant to the list caller_gcp_service_account_ids
:
eg
For GCP-hosted case, add the email address of your Worklytics tenant to the list worklytics_sa_emails
:
eg
can be completed without call; but Worklytics can assist if desired
follow TODO 3
files (or terraform output values) generated by the terraform apply
command
if you do not have access to Worklytics, or you do, but do not have Data Connection Admin
role, send these files to the appropriate person
Done with your Psoxy deployment?
Terraform makes it easy to clean up when you're through with Psoxy, of you wish to rebuild everything from scratch.
First, a few caveats:
this will NOT undo any changes outside of Terraform, even those we instructed you to perform via TODO -
files that Terraform may have generated.
be careful with anything you created outside of Terraform and later imported into Terraform, such as GCP project / AWS account themselves. If you DON'T want to destroy these, do terraform state rm <resource>
(analogue of the import) for each.
Do the following to destroy your Psoxy infra:
open you main.tf
of your terraform confriguation; remove ALL blocks that aren't terraform
, or provider
. You'll be left with ~30 lines that looks like the following.
NOTE: do not edit your terraform.tfvars
file or remove any references to your AWS / Azure / GCP accounts; Terraform needs be authenticated and know where to delete stuff from!
run terraform apply
. It'll prompt you with a plan that says "0 to create, 0 to modify" and then some huge number of things to destroy. Type 'yes' to apply it.
That's it. It should remove all the Terraform infra you created.
if you want to rebuild from scratch, revert your changes to main.tf
(git checkout main.tf
) and then terraform apply
again.
This is related to gcloud
not being authenticated (or installed?) in the environment where you're running terraform, which the google
terraform provider requires.
If you DO NOT intend to use Google Workspace as a data source, you should do the following:
remove the google-*.tf
files from your terraform configuration
remove module/local references from your main.tf
file that referred to those files; as of v0.4.53
, there are 3 such references you must remove; you will get errors in terraform commands until you remove all of them. The error messages should reference the impacted line numbers.
If you DO intend to use Google Workspace as a data source, you must install and authenticate the gcloud
CLI and/or modify the google
provider block in google-workspace.tf
with your desired authentication details. See: Google Terraform Provider
Our example templates include a script to check for the prerequisites for running the psoxy. You can run this prior to ./init
to get feedback/suggestions on what prerequisites you may be missing and how to install them.
Our example Terraform configurations should compile and package the Java code into a JAR file, which is then deployed by Terraform to your host environment.
This is done via a build script, invoked by a Terraform module (see modules/psoxy-package
).
If, on your first terraform plan
/terraform apply
, you see the line such as
module.psoxy-aws-msft-365.module.psoxy-aws.module.psoxy-package.data.external.deployment_package: Reading...
And that returns really quickly, something may have gone wrong with the build. You can trigger the build directly by running:
That may give you some clues as to what went wrong.
You can also look for a file called last-build.log
in the directory where your Terraform configuration resides.
If you want to go step-by-step, you can run the following commands:
Some problems we've seen:
Maven repository access - the build process must get various dependencies from a remote Maven respository; if your laptop cannot reach Maven Central, is configured to get dependencies from some other Maven repository, etc - you might need to fix this issue. You can check your ~/.m2/settings.xml
file, which might give you some insight into what Maven repository you're using. It's also where you'd configure credentials for a private Maven repository, such as Artifactory/etc - so make sure those are correct.
If you upgrade your psoxy code, it may be worth trying terraform init --upgrade
to make sure you have the latest versions of all Terraform providers on which our configuration depends.
By default, terraform locks providers to the version that was the latest when you first ran terraform init
. It does not upgrade them unless you explicitly instruct it to. It will not prompt you to upgrade them unless we update the version constraints in our modules.
While we strive to ensure accurate version constraints, and use provider features consistent with these constraints, our automated tests will run with the latest version of each provider. Regretably, we don't currently have a way to test with ALL versions of each provider that satisfy the constraints, or all possible combinations of provider versions.
Often, in response to errors, a second run of terraform apply
will work.
If something was actually created in the cloud provider, but Terraform state doesn't reflect it, then try terraform import [resource] [provider-resource-id]
. [resource]
should be replaced with whatever the path to it is in your terraform configuration, which you can get from the terraform plan
output. provider-resource-id
is a little trickier, and you might need to find the format required by finding the Terraform docs for the resource type on the web.
NOTE: resources in plan with brackets/quotes will need these escaped with a backslash for use in bash commands.
eg
Errors such as the following on terraform plan
?
The solution is to downgrade your Terraform version to one that's supported by our modules (>= 1.3.x, <= 1.7.x as of March 2024).
If you're running Terraform in cloud/CI environment, including Terraform Cloud, GitHub Actions, etc, you can likely explicitly set the desired Terraform version in your workspace settings / terraform setup action.
If you're running Terraform on your laptop or in a VM, use your package manager to downgrade or something like tfenv
to concurrently use distinct Terraform versions on the machine. (set version <= 1.7.x in .terraform-version
file in the root of your Terraform configuration for the proxy).
Psoxy can be used to sanitize bulk files (eg, CSV, NDSJON, etc), writing the result to another bucket.
You can automate a data pipeline to push files to an -input
bucket, which will trigger a Psoxy instance (GCP Cloud Function or AWS Lambda), which will read the file, sanitize it, and write the result to a corresponding -sanitized
bucket.
You should limit the size of files processed by proxy to 200k rows or less, to ensure processing of any single file finishes within the run time limitations of the host platform (AWS, GCP). There is some flexibility here based on the complexity of your rules and file schema, but we've found 200k to be a conservative target.
To improve performance and reduce storage costs, you should compress (gzip) the files you write to the -input
bucket. Psoxy will decompress gzip files before processing and then compress the result before writing to the -sanitized
bucket. Ensure that you set Content-Encoding: gzip
on all files in your -input
bucket to enable this behavior. Note that if you are uploading files via the web UI in GCP/AWS, it is not possible to set this metadata in the initial upload - so you cannot use compression in such a scenario.
The 'bulk' mode of Psoxy supports either column-oriented or record-oriented file formats.
To cater to column-oriented file formats (Eg .csv, .tsv), Psoxy supports a ColumnarRules
format for encoding your sanitization rules. This rules format provides simple/concise configuration for these cases, where more complex processing of repeated values / complex field types is required.
If your use-case is record oriented (eg, NDJSON
, etc), with nested or repeated fields, then you will likely need RecordRules
as an alternative.
The core function of the Proxy is to pseudonymize PII in your data. To pseudonymize a column, add it to columnsToPseudonymize
.
To avoid inadvertent data leakage, if a column specified to be pseudonymized is not present in the input data, the Proxy will fail with an error. This is to avoid simple column name typos resulting in data leakage.
To ease integration, the 'bulk' mode also supports a few additional common transformations that may be useful. These provide an alternative to using a separate ETL tool to transform your data, or modifying your existing data export pipelines.
Redaction
To redact a column, add it to columnsToRedact
. By default, all columns present in the input data will be included in the output data, unless explicitly redacted.
Inclusion
Alternatively to redacting columns, you can specify columnsToInclude
. If specified, only columns explicitly included will be included in the output data.
Renaming Columns
To rename a column, add it to columnsToRename
, which is a map from original name --> desired name. Renames are applied before pseudonymization.
This feature supports simple adaptation of existing data pipelines for use in Worklytics.
Rule structure is specified in ColumnarRules
.
As of Oct 2023, this is a beta feature
RecordRules
parses files as records, presuming the specified format. It performs transforms in order on each record to sanitize your data, and serializes the result back to the specified format.
eg.
Each transform
is a map from transform type --> to a JSONPath to which the transform should be applied. The JSONPath is evaluated from the root of each record in the file.
The above example rules applies two transforms. First, it redacts $.summary
- the summary
field at the root at of the record object. Second, it pseudonymizes $.email
- the email
field at the root of the record object.
transforms
itself is an ordered-list of transforms. The transforms should be applied in order.
CSV format is also supported, but in effect is converted to a simple JSON object before rules are applied; so JSON paths in transforms should all be single-level; eg, $.email
to refer to the email
column in the CSV.
Rule structure is specified in RecordRules
.
As of Oct 2023, this feature is in beta and may change in backwards incompatible ways
You can process multiple file formats through a single proxy instance using MultiTypeBulkDataRules
.
These rules are structured with a field fileRules
, which is a map from parameterized path template within the "input" bucket to one of the above rule types (RecordRules
,ColumnarRules
) to be applied to files matching that path template.
Path templates are evaluated against the incoming file (object) path in order, and the first match is applied to the file. If no templates match the incoming file, it will not be processed.
Worklytics' provided Terraform modules include default rules for expected formats for hris
, survey
, and badge
data connectors.
If your input data does not match the expected formats, you can customize the rules in one of the following ways.
NOTE: The configuration approaches described below utilized Terraform variables as provided by our gcp and aws template examples. Other examples may not support these variables; please consult the variables.tf
at the root of your configuration. If you are directly using Worklytics' Terraform modules, you can consult the variables.tf
in the module directory to see if these variables are exposed.
You can override the rules used by the predefined bulk connectors (eg hris
, survey
, badge
) by filling the custom_bulk_connector_rules
variable in your Terraform configuration.
This variable is a map from connector ID --> rules, with the rules encoded in HCL format (rather than YAML as shown above). An illustrative example:
This approach ONLY supports ColumnarRules
Rather than enabling one of the predefined bulk connectors providing in the worklytics-connector-specs
Terraform module, you can specify a custom connector from scratch, including your own rules.
This approach is less convenient than the previous one, as TODO documentation and deep-links for connecting your data to Worklytics will not be generated.
To create a Custom Bulk Connector, use the custom_bulk_connectors
variable in your Terraform configuration, for example:
The above example is for ColumnarRules
.
You can directly modify the RULES
environment variable on the Psoxy instance, by directly editing your instance's environment via your hosting provider's console or CLI. In this case, the rules should be encoded in YAML format, such as:
Alternatively, you can remove the environment variable from your instance, and instead configure a RULES
value in the "namespace" of your instance, in the AWS Parameter Store or GCP Secret Manager (as appropriate for your hosting provider).
This approach is useful for testing, but note that if you later run terraform apply
again, any changes you make to the environment variable may be overwritten by Terraform.
If you encounter issues processing your files, check the logs of the Psoxy instance. The logs will give some indication of what went wrong, and may help you identify the issue.
Causes: The column specified in columnsToPseudonymize
is not present in the input data or contains empty values. Any column specified in columnsToPseudonymize
must be present in the input data.
Solution: Regenerate your input file removing empty values for mandatory columns.
Causes: The file size is too large for the Psoxy instance to process, likely in AWS Lambda in proxy versions prior to v0.4.54.
Solutions:
Use compression in the file (see Compression); if already compressed, then:
Split the file into smaller files and process them separately
(AWS only) Update the proxy version to v0.4.55 or later
(AWS only) If in v0.4.55 or later, process the files one by one or increase the ephemeral storage allocated to the Lambda function (see https://aws.amazon.com/blogs/aws/aws-lambda-now-supports-up-to-10-gb-ephemeral-storage/)
Psoxy supports specifying sanitization rule sets to use to sanitize data from an API. These can be configured by encoding a rule set in YAML and setting a parameter in your instance's configuration. See an example of rules for Zoom: zoom.yaml
.
If such a parameter is not set, a proxy instances selects default rules based on source kind, from the corresponding supported source.
You can configure custom rule sets for a given instance via Terraform, by adding an entry to the custom_api_connector_rules
map in your terraform.tfvars
file.
eg,
<ruleset> ::= "endpoints:" <endpoint-list>
<endpoint-list> ::= <endpoint> | <endpoint> <endpoint-list>
A ruleset is a list of API endpoints that are permitted to be invoked through the proxy. Requests which do not match a endpoint in this list will be rejected with a 403
response.
<endpoint> ::= <path-template> <allowed-methods> <path-parameter-schemas> <query-parameter-schemas> <response-schema> <transforms>
<path-template> ::= "- pathTemplate: " <string>
Each endpoint is specified by a path template, based on OpenAPI Spec v3.0.0 Path Template syntax. Variable path segments are enclosed in curly braces ({}
) and are matched by any value that does not contain an /
character.
See: https://swagger.io/docs/specification/paths-and-operations/
<allowed-methods> ::= "- allowedMethods: " <method-list>
<method-list> ::= <method> | <method> <method-list>
<method> ::= "GET" | "POST" | "PUT" | "PATCH" | "DELETE" | "HEAD"
If provided, only HTTP methods included in this list will be permitted for the endpoint. Given semantics of RESTful APIs, this allows an additional point to enforce "read-only" data access, in addition to OAuth scopes/etc.
NOTE: for AWS-hosted deployments using API Gateway, IAM policies and routes may also be used to restrict HTTP methods. See aws/guides/api-gateway.md for more details.
<path-parameter-schemas> ::= "- pathParameterSchemas: " <parameter-schema>
<query-parameter-schemas> ::= "- queryParameterSchemas: " <parameter-schema>
alpha - a parameter schema to use to validate path/query parameter values; if validation fails, proxy will return 403 forbidden response. Given the use-case of validating URL / query parameters, only a small subset of JSON Schema is supported.
As of 0.4.38, this is considered an alpha feature which may change in backwards-incompatible ways.
Currently, the supports JSON Schema features are:
type
:
string
value must be a JSON string
integer
value must be a JSON integer
number
value must be a JSON number
format
:
reversible-pseudonym
: value MUST be a reversible pseudonym generated by the proxy
pattern
: a regex pattern to match against the value
enum
: a list of values to match against the value
null
/empty is valid for all types; you can use a pattern to restrict this further.
<response-schema> ::= "responseSchema: " <json-schema-filter>
See: Response Schema Specification below.
<transforms> ::= "transforms:" <transform-list>
<transform-list> ::= <transform> | <transform> <transform-list>
For each Endpoint, rules specify a list of transforms to apply to the response content.
<transform> ::= "- " <transform-type> <json-paths> [<encoding>]
Each transform is specified by a transform type and a list of JSON paths. The transform is applied to all portions of the response content that match any of the JSON paths.
Supported Transform Types:
<transform-type> ::= "!<pseudonymizeEmailHeader>" | "!<pseudonymize>" | "!<redact>" | "!<redactRegexMatches>" | "!<tokenize>" | "<!filterTokenByRegex>" | "!<redactExceptSubstringsMatchingRegexes"
NOTE: these are implementations of com.avaulta.gateway.rules.transforms.Transform
class in the psoxy codebase.
!<pseudonymize>
- transforms matching values by normalizing them (triming whitespace; if appear to be emails, treating them as case-insensitive, etc) and computing a SHA-256 hash of the normalized value. Relies on SALT
value configured in your proxy environment to ensure the SHA-256 is deterministic across time and between sources. In the case of emails, the domain portion is preserved, although the hash is still based on the entire normalized value (avoids hash of alice@acme.com
matching hash of alice@beta.com
).
Options:
includeReversible
(default: false
): If true
, an encrypted form of the original value will be included in the result. This value, if passed back to the proxy in a URL, will be decrypted back to the original value before the request is forward to the data source. This is useful for identifying values that are needed as parameters for subsequent API requests. This relies on symmetric encryption using the ENCRYPTION_KEY
secret stored in the proxy; if ENCRYPTION_KEY
is rotated, any 'reversible' value previously generated will no longer be able to be decrypted by the proxy.
encoding
(default: JSON
): The encoding to use when serializing the pseudonym to a string.
JSON
- a JSON object structure, with explicit fields
URL_SAFE_TOKEN
- a string format that aims to be concise, URL-safe, and format-preserving for email case.
!<pseudonymizeEmailHeader>
- transforms matching values by parsing the value as an email header, in accordance with RFC 2822 and some typical conventions, and generating a pseudonym based only on the normalized email address itself (ignoring name, etc that may appear) . In particular:
deals with CSV lists (multiple emails in a single header)
handles the name <email>
format, in effect redacting the name and replacing with a pseudonym based only on normalized email
!<redact>
- removes the matching values from the response.
Some extensions of redaction are also supported:
!<redactExceptSubstringsMatchingRegexes>
- removes the matching values from the response except value matches one of the specified regex
options. (Use case: preserving portions of event titles if match variants of 'Focus Time', 'No Meetings', etc)
!<redactRegexMatches>
- redact content IF it matches one of the regex
s included as an option.
By using a negation in the JSON Path for the transformation, !<redact>
can be used to implement default-deny style rules, where all fields are redacted except those explicitly listed in the JSON Path expression. This can also redact object-valued fields, conditionally based on object properties as shown below.
Eg, the following redacts all headers that have a name value other than those explicitly listed below:
!<tokenize>
- replaces matching values it with a reversible token, which proxy can reverse to the original value using ENCRYPTION_KEY
secret stored in the proxy in subsequent requests.
Use case are values that may be sensitive, but are opaque. For example, page tokens in Microsoft Graph API do not have a defined structure, but in practice contain PII.
Options:
regex
a capturing regex to use to extract portion of value that needs to be tokenized.
!<filterTokenByRegex>
- tokenizes matching string values by a delimiter, if provided; and matches result against a list of filters
, removing any content that doesn't match at least one of the filters. (Use case: preserving Zoom URLs in meeting descriptions, while removing the rest of the description)
Options:
delimiter
- used to split the value into tokens; if not provided, the entire value is treated as a single token.
filters
- in effect, combined via OR; tokens matching ANY of the filters is preserved in the value.
A "response schema" is a "JSON Schema Filter" structure, specifying how response (which must be JSON) should be filtered. Using this, you can implement a "default deny" approach to sanitizing API fields in a manner that may be more convenient than using JSON paths with conditional negations (a redact transform with a JSON path that matches all but an explicit list of named fields is the other approach to implementing 'default deny' style rules).
Our "JSON Schema Filter" implementation attempts to align to the JSON Schema specification, with some variation as it is intended for filtering rather than validation. But generally speaking, you should be able to copy the JSON Schema for an API endpoint from its OpenAPI specification as a starting point for the responseSchema
value in your rule set. Similarly, there are tools that can generate JSON Schema from example JSON content, as well as from data models in various languages, that may be useful.
See: https://json-schema.org/implementations.html#schema-generators
If a responseSchema
attribute is specified for an endpoint
, the response content will be filtered (rather than validated) against that schema. Eg, fields NOT specified in the schema, or not of expected type, will be removed from the response.
type
- one of :
object
a JSON object
array
a JSON array
string
a JSON string
number
a JSON number, either integer or decimal.
integer
a JSON integer (not a decimal)
boolean
a JSON boolean
properties
- for type == object
, a map of field names to schema to filter field's value against (eg, another JsonSchemaFilter
for the field itself)
items
- for type == array
, a schema to filter each item in the array against (again, a JsonSchemaFilter
)
format
- for type == string
, a format to expect for the string value. As of v0.4.38, this is not enforced by the proxy.
$ref
- a reference to a schema specified in the definitions
property of the root schema.
definitions
- a map of schema names to schemas of type JsonSchemaFilter
; only supported at root schema of endpoint.
Example:
The following is for a User from the GitHub API, via graphql. See: https://docs.github.com/en/graphql/guides/forming-calls-with-graphql#the-graphql-endpoint
By default, proxy from version 0.4.61 will connect to data source APIs using TLS 1.3.
Prior to 0.4.61, the proxy should have negotiated to use 1.3 with all sources that supported it; but may have fallen back to 1.2 for some sources.
It will no longer fall back; but you can configure the proxy to use TLS 1.2 for a given source by setting the TLS_VERSION
environment variable on a proxy instance to TLSv1.2
. As TLS 1.3 offers security and performance improvements, we recommend using it whenever possible.
As of Sept 2024, we've confirmed that the following public APIs of various data sources support TLS 1.3, either through end-to-end proxy testing OR via openssl negotiation (see next section):
Google Workspace
Microsoft 365 (Microsoft Graph)
GitHub (cloud version)
Asana
Atlassian (JIRA, etc)
Slack
Zoom
To test TLS 1.3 support, you can use something like the following command (assuming you have openssl
installed on a Mac):
Given nature of use-case, proxy does A LOT of network transit:
client to proxy (request)
proxy to data source (request)
data source to proxy (response)
proxy to client (response)
So this drives cost in several ways:
larger network payloads increases proxy running time, which is billable
network volume itself is billable in some host platforms
indirectly, clients are waiting for proxy to respond, so that's an indirect cost (paid on client-side)
Generally, proxy is transferring JSON data, which is highly compressible. Using gzip
likely to reduce network volume by 50-80%. So we want to make sure we do this everywhere.
As of Aug 2023, we're not bothering with compressing requests, as expected to be small (eg, current proxy use-cases don't involve large PUT
/ POST
operations).
Compression must be managed at the application layer (eg, in our proxy code).
This is done in co.worklytics.psoxy.Handler
, which uses ResponseCompressionHandler
to detect request for compressed response, and then compress the response.
API Gateway is no longer used by our default terraform examples. But compression can be enabled at the gateway level (rather than relying on function url implementation, or in addition to).
https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-gzip-compression-decompression.html
GCP Cloud Functions will handle compression themselves IF the request meets various conditions.
There is no explicit, Cloud Function-specific documentation about this, but it seems that the behavior for App Engine applies:
https://cloud.google.com/appengine/docs/legacy/standard/go111/how-requests-are-handled#:~:text=For%20responses%20that%20are%20returned,HTML%2C%20CSS%2C%20or%20JavaScript.
All requests should be built using GzipedContentHttpRequestInitializer
, which should add:
Accept-Encoding: gzip
append (gzip)
to User-Agent
header
We believe this will trigger compression for most sources (the User-Agent thing being practice that Google seems to want).
The problem we're trying to solve is that various features, such as VPCs/etc, are relevant to a small set of users. It would complicate the usual cases to enable them for all cases. So we need to provide easy support for extending the examples/modules to support them in the extreme cases.
Composition is the canonical terraform approach.
Two approaches:
composition, which is canonical terraform
a. commented out
validation
instructions to explain to customers are more complex
b. conditional: validation will work, but hacky 0 indexes around in places
conditionals + variables
pros:
simplest for customers
easiest to read/follow
cons:
verbose interfaces
brittle stacks (changing variable requires changing many in hierarchy)
For core
, seems need to use explicit processor path (which IntelliJ fills for you), rather than classpath. And output generated code to Module Content Root, not Module Output directory.
If you want to make private (non-public) customization to Psoxy's source/terraform modules, you may wish to create a private fork of the repo. (if you intend to commit your changes, a public fork in GitHub should suffice)
See , for guidance.
Specific commands for Psoxy repo are below:
This directory documents some features that we consider "alpha" - available for use, but without any guarantee of long-term support or stability.
is a popular secret management solution. We've attempted to implement using it as an alternative to secret management solutions provided by AWS/GCP.
see
We've done a PoC for logging from psoxy back to New Relic for monitoring. Usage explanation below:
see
From main
:
follow steps output by that tool
if need interim testing, create a "branch" of the release (eg, branch v0.4.16
instead of tag), and trigger gh workflow run ci-terraform-examples-release.yaml
On rc-
:
QA aws, gcp dev examples by running terraform apply
for each, and testing various connectors.
Scan a GCP container image for vulnerabilities:
Create PR to merge rc-
to main
.
After merged to main
:
With v0.4.47
, we're adding alpha support for AWS Secrets Manager. This feature is not yet fully documented or stable.
A couple notes:
some connectors, in particular Zoom/Jira, rotate tokens frequently so will generate a lot of versions of secrets. AFAIK, AWS will still bill you for just one secret, as only one should be staged as the 'current' version. But you should monitor this and review the particular terms and pricing model of your AWS contract.
our modules will create secrets ONLY in the region into which your proxy infra is being deployed, based on the value set in your terraform.tfvars
file.
Migration from Parameter Store: the default storage for secrets is as AWS Systems Manager Parameter Store SecureString parameters. If you have existing secrets in Parameter Store that aren't managed by terraform, you can copy them to a secure location to avoid needing to re-create them for every source.
If you forked the psoxy-example-aws
repo prior to v0.4.47
, you should copy a main.tf
and variables.tf
from that version of later of the repo and unify the version numbers with your own. (>=0.4.47)
Add the following to your terraform.tfvars
:
If you previously filled any secret values via AWS web console (such as API secrets you were directed to create in TODO 1
files for certain sources, you should copy those values now).
terraform apply
; review plan and confirm when ready
Fill values of secrets in Secrets Manager that you copied from Parameter Store. If you did not copy the values, see the TODO 1..
files for each connector to obtain new values.
Navigate to the AWS Secrets Manager console and find the secret you need to fill. If there's not an option to fill the value, click 'Retrieve secret value'; it should then prompt you with option to fill it.
IMPORTANT: Choose 'Plain Text' and remove the brackets ({}
) that AWS prefills the input with!
Then copy-paste the value EXACTLY as-is. Ensure no leading/trailing whitespace or newlines, and no encoding problems.
AWS Secret Manager secrets will be stored/accessed with same path as SSM parameters. Eg, at value of aws_ssm_param_root_path
, if any.
q: support distinct path for secrets? or generalize parameter naming?
As of Nov 10, 2022, Psoxy has added alpha support for using Hashicorp Vault as its secret store (rather than AWS Systems Manager Parameter Store or GCP Secret Manager). We're releasing this as an alpha feature, with potential for breaking changes to be introduced in any future release, including minor releases which should not break production-ready features.
NOTE: you will NOT be able to use the Terraform examples found in infra/examples
; you will have to adapt the modular forms of those found in infra/modular-examples
, swapping the host platform's secret manager for Vault.
Set the following environment variables in your instance:
VAULT_ADDR
- the address of your Vault instance, e.g. https://vault.example.com:8200
NOTE: must be accessible from AWS account / GCP project where you're deploying
VAULT_TOKEN
- choose the appropriate token type for your use case; we recommend you use a periodic token that can lookup and renew itself, with period of > 8 days. With such a setup, Psoxy will look up and renew this token as needed. Otherwise, it's your responsibility either renew it OR replace it by updating this environment variable before expiration.
VAULT_NAMESPACE
- optional, if you're using Vault Namespaces
PATH_TO_SHARED_CONFIG
- eg, secret/worklytics_deployment/PSOXY_SHARED/
PATH_TO_INSTANCE_CONFIG
- eg, secret/worklytics_deployment/PSOXY_GCAL/
Configure your secrets in Vault. Given the above, Psoxy will connect to Vault in lieu of the usual Secret storage solution for your cloud provider. It will expect config properties (secrets) organized as follows:
global secrets: ${PATH_TO_SHARED_CONFIG}${PROPERTY_NAME}
, eg with PATH_TO_SHARED_CONFIG
- eg, secret/worklytics_deployment/PSOXY_SHARED/
then:
secret/worklytics_deployment/PSOXY_SHARED/PSOXY_SALT
secret/worklytics_deployment/PSOXY_SHARED/PSOXY_ENCRYPTION_KEY
per-connector secrets:${PATH_TO_CONNECTOR_CONFIG}${PROPERTY_NAME}
eg with PATH_TO_INSTANCE_CONFIG
as secret/worklytics_deployment/PSOXY_GCAL/
:
secret/worklytics_deployment/PSOXY_GCAL/RULES
secret/worklytics_deployment/PSOXY_GCAL/ACCESS_TOKEN
secret/worklytics_deployment/PSOXY_GCAL/CLIENT_ID
secret/worklytics_deployment/PSOXY_GCAL/CLIENT_SECRET
Ensure ACL permits 'read' and, if necessary, write. Psoxy will need to be able to read secrets from Vault, and in some cases (eg, Oauth tokens subject to refresh) write. Additionally, if you're using a periodic token as recommended, the token must be authorized to lookup and renew itself.
Generally, follow Vault's guide: https://developer.hashicorp.com/vault/docs/auth/aws
We also have a Terraform module you can try to set-up Vault for use from Psoxy:
And another Terraform module to add Vault access for each psoxy instance:
Manually, steps are roughly:
Create IAM policy needed by Vault in your AWS account.
Create IAM User for Vault in your AWS account.
Enable aws
auth method in your Vault instance. Set access key + secret for the vault user created above.
Create a Vault policy to allow access to the necessary secrets in Vault.
Bind a Vault role with same name as your lambda function with lambda's AWS exec role (once for each lambda)
NOTE: pretty certain must be plain role arn, not assumed_role arn even though that's what vault sees
eg arn:aws:iam::{{YOUR_AWS_ACCOUNT_ID}}:role/PsoxyExec_psoxy-gcal
not arn:aws:sts::{{YOUR_AWS_ACCOUNT_ID}}:assumed_role/PsoxyExec_psoxy-gcal/psoxy-gcal
TODO
This section describes many of the pre-configured data sources that can be connected to Worklytics via Psoxy.
The table below enumerates the available connectors, provided via the worklytics-connector-specs
Terraform module (see infra/modules/worklytics-connector-specs).
To add a source, add its Connector ID to the enabled_connectors
list in your terraform.tfvars
file.
asana
Asana
API
GA
azure-ad
Azure Active Directory
API
DEPRECATED
badge
Badge
Bulk
GA
dropbox-business
Dropbox Business
API
DEPRECATED
gcal
Google Calendar
API
GA
gdrive
Google Drive
API
GA
gdirectory
Google Directory
API
GA
github
GitHub
API
GA
github-enterprise-server
GitHub Enterprise Server
API
GA
github-non-enterprise
GitHub Non-Enterprise
API
GA
gmail
Gmail
API
GA
google-chat
Google Chat
API
GA
google-meet
Google Meet
API
GA
hris
HR Information System (eg, Workday)
Bulk
GA
jira-cloud
Jira Cloud
API
GA
jira-server
Jira Server
API
GA
outlook-cal
Outlook Calendar
API
GA
outlook-mail
Outlook Mail
API
GA
msft-teams
Microsoft Teams
API
BETA
msft-entra-id
Microsoft Entra ID
API
GA
qualtrics
Qualtrics (survey)
API
GA
salesforce
Salesforce
API
GA
slack-discovery-api
Slack Discovery API
API
GA
survey
Survey
Bulk
GA
zoom
Zoom
API
GA
From v0.4.58, you can confirm the availability of a connector by running the following command from the root of on of our examples:
If your using v0.4.58+ of our terraform modules, but don't have the above script in your example, try
Create a Service Account User + token or a sufficiently Personal Access Token for a sufficiently privileged user (who can see all the workspaces/teams/projects/tasks you wish to import to Worklytics via this connection).
Update the content of PSOXY_ASANA_ACCESS_TOKEN
variable or ACCESS_TOKEN
environment variable with the token value obtained in the previous step.
NOTE: derived from worklytics-connector-specs; refer to that for definitive information.
Availability: BETA
There are several connectors available for GitHub:
[GitHub Free/Pro/Teams] - for non-Enterprise GitHub organization hosted in github.com.
[GitHub Enterprise Cloud] - GitHub Enterprise instances hosted by github.com on behalf of your organization.
[GitHub Enterprise Server] - similar to 'Cloud', but you must customize rules and API host; contact Worklytics for assistance.
The connector uses a GitHub App to authenticate and access the data.
For Enterprise Server, you must generate a user access token.
For Cloud, including Free/Pro/Teams/Enterprise, you must provide an installation token for authentication.
Both share the same configuration and setup instructions except Administration permission for Audit Log events.
Follow the following steps:
Populate github_organization
variable in Terraform with the name of your GitHub organization.
From your organization, register a GitHub App with following permissions with Read Only:
Repository:
Contents: for reading commits and comments
Issues: for listing issues, comments, assignees, etc.
Metadata: for listing repositories and branches
Pull requests: for listing pull requests, reviews, comments and commits
Organization
Administration: (Only for GitHub Enterprise) for listing events from audit log
Members: for listing teams and their members
NOTES:
We assume that ALL the repositories are going to be list should be owned by the organization, not the users.
Apart from GitHub instructions please review the following:
"Homepage URL" can be anything, not required in this flow but required by GitHub.
Webhooks check can be disabled as this connector is not using them
Keep Expire user authorization tokens
enabled, as GitHub documentation recommends
Once is created please generate a new Private Key
.
It is required to convert the format of the certificate downloaded from PKCS#1 in previous step to PKCS#8. Please run following command:
NOTES:
If the certificate is not converted to PKCS#8 connector will NOT work. You might see in logs a Java error Invalid PKCS8 data.
if the format is not correct.
Command proposed has been successfully tested on Ubuntu; it may differ for other operating systems.
Install the application in your organization. Go to your organization settings and then in "Developer Settings". Then, click on "Edit" for your "Github App" and once you are in the app settings, click on "Install App" and click on the "Install" button. Accept the permissions to install it in your whole organization.
Once installed, the installationId
is required as it needs to be provided in the proxy as parameter for the connector in your Terraform module. You can go to your organization settings and click on Third Party Access
. Click on Configure
the application you have installed in previous step and you will find the installationId
at the URL of the browser:
Copy the value of installationId
and assign it to the github_installation_id
variable in Terraform. You will need to redeploy the proxy again if that value was not populated before.
NOTE:
If github_installation_id
is not set, authentication URL will not be properly formatted and you will see 401: Unauthorized when trying to get an access token.
If you see 404: Not found in logs please review the IP restriction policies that your organization might have; that could cause connections from psoxy AWS Lambda/GCP Cloud Functions be rejected.
Update the variables with values obtained in previous step:
PSOXY_GITHUB_CLIENT_ID
with App ID
value. NOTE: It should be App Id
value as we are going to use authentication through the App and not client_id.
PSOXY_GITHUB_PRIVATE_KEY
with content of the gh_pk_pkcs8.pem
from previous step. You could open the certificate with VS Code or any other editor and copy all the content as-is into this variable.
Once the certificate has been uploaded, please remove {YOUR DOWNLOADED CERTIFICATE FILE} and gh_pk_pkcs8.pem
from your computer or store it in a safe place.
We provide a helper script to set up the connector, which will guide you through the steps below and automate some of them. Alternatively, you can follow the steps below directly:
You have to populate:
github_enterprise_server_host
variable in Terraform with the hostname of your GitHub Enterprise Server (example: github.your-company.com
). This host should be accessible from the proxy instance function, as the connector will need to reach it.
github_organization
variable in Terraform with the name of your organization in GitHub Enterprise Server. You can put more than one, just split them in commas (example: org1,org2
).
From your organization, register a GitHub App with following permissions with Read Only:
Repository:
Contents: for reading commits and comments
Issues: for listing issues, comments, assignees, etc.
Metadata: for listing repositories and branches
Pull requests: for listing pull requests, reviews, comments and commits
Organization
Administration: for listing events from audit log
Members: for listing teams and their members
NOTES:
We assume that ALL the repositories are going to be listed should be owned by the organization, not the users.
Apart from GitHub instructions please review the following:
"Homepage URL" can be anything, not required in this flow but required by GitHub.
"Callback URL" can be anything, but we recommend something like http://localhost
as we will need it for the redirect as part of the authentication.
Webhooks check can be disabled as this connector is not using them
Keep Expire user authorization tokens
enabled, as GitHub documentation recommends
Once is created please generate a new Client Secret
.
Copy the Client ID
and copy in your browser following URL, replacing the CLIENT_ID
with the value you have just copied:
The browser will ask you to accept permissions and then it will redirect you with to the previous Callback URL
set as part of the application. The URL should look like this: https://localhost/?code=69d0f5bd0d82282b9a11
.
Copy the value of code
and run the following URL replacing in the placeholders the values of Client ID
and Client Secret
:
The response will be something like:
You will need to copy the value of the refresh_token
.
NOTES:
Code
can be used once, so if you need to repeat the process you will need to generate a new one.
Update the variables with values obtained in previous step:
psoxy_GITHUB_ENTERPRISE_SERVER_CLIENT_ID
with Client Id
value.
psoxy_GITHUB_ENTERPRISE_SERVER_CLIENT_SECRET
with Client Secret
value.
psoxy_GITHUB_ENTERPRISE_SERVER_REFRESH_TOKEN
with the refresh_token
.
These instructions have been derived from worklytics-connector-specs; refer to that for definitive information.
The Dropbox Business connector through Psoxy requires a Dropbox Application created in Dropbox Console. The application does not require to be public, and it needs to have the following scopes to support all the operations for the connector:
files.metadata.read
: for file listing and revision
members.read
: member listing
events.read
: event listing
groups.read
: group listing
Go to https://www.dropbox.com/apps and Build an App
Then go https://www.dropbox.com/developers to enter in App Console
to configure your app
Now you are in the app, go to Permissions
and mark all the scopes described before. NOTE: Probably the UI will mark you more required permissions automatically (like account_info_read.) Just mark the ones described here and the UI will ask you to include any other required.
On settings, you could access to App key
and App secret
. You can create an access token here, but with limited expiration. We need to create a long-lived token, so edit the following URL with your App key
and paste it into the browser:
https://www.dropbox.com/oauth2/authorize?client_id=<APP_KEY>&token_access_type=offline&response_type=code
That will return an Authorization Code
that you have to paste. NOTE This Authorization Code
if for a one single use; if expired or used you will need to get it again pasting the URL in the browser.
Now, replace the values in following URL and run it from command line in your terminal. Replace Authorization Code
, App key
and App secret
in the placeholders:
curl https://api.dropbox.com/oauth2/token -d code=<AUTHORIZATION_CODE> -d grant_type=authorization_code -u <APP_KEY>:<APP_SECRET>
After running that command, if successful you will see a JSON response like this:
Finally set following variables in AWS System Manager parameters store / GCP Cloud Secrets (if default implementation):
PSOXY_dropbox_business_REFRESH_TOKEN
secret variable with value of refresh_token
received in previous response
PSOXY_dropbox_business_CLIENT_ID
with App key
value.
PSOXY_dropbox_business_CLIENT_SECRET
with App secret
value.
Example commands (*) that you can use to validate proxy behavior against the Google Workspace APIs. Follow the steps and change the values to match your configuration when needed.
You can use the -i
flag to impersonate the desired user identity option when running the testing tool. Example:
For AWS, change the role to assume with one with sufficient permissions to call the proxy (-r
flag). Example:
If any call appears to fail, repeat it using the -v
flag.
(*) All commands assume that you are at the root path of the Psoxy project.
Get the calendar event ID (accessor path in response .items[0].id
):
Get event information (replace calendar_event_id
with the corresponding value):
Get the group ID (accessor path in response .groups[0].id
):
Get group information (replace google_group_id
with the corresponding value):
Get the user ID (accessor path in response .users[0].id
):
Get user information (replace [google_user_id] with the corresponding value):
Thumbnail (expect have its contents redacted; replace [google_user_id] with the corresponding value):
API v2
API v3 (*)
(*) Notice that only the "version" part of the URL changes, and all subsequent calls should work for v2
and also v3
.
Get the file ID (accessor path in response .files[0].id
:
Get file details (replace [drive_file_id] with the corresponding value):
YMMV, as file at index 0
must actually be a type that supports revisions for this to return anything. You can play with different file IDs until you find something that does.
YMMV, as file at index 0
must actually be a type that has comments for this to return anything. You can play with different file IDs until you find something that does.
NOTE probably blocked by OAuth metadata only scope!!
NOTE probably blocked by OAuth metadata only scope!!
Get file comment ID (accessor path in response .items[0].id
):
Get file comment details (replace file_comment_id
with the corresponding value):
NOTE probably blocked by OAuth metadata only scope!!
YMMV, as above, play with the file comment ID value until you find a file with comments, and a comment that has replies.
NOTE: limited to 10 results, to keep it readable.
NOTE: limited to 10 results, to keep it readable.
As of May 2023, Atlassian has announced they will stop supporting Jira Server on Feb 15, 2024. Our Jira Server connector is intended to be compatible with Jira Data Center as well.
NOTE: as of Nov 2023, organizations are making production use of this connector; we've left it as alpha due to impending obsolescence of Jira Server.
NOTE: derived from worklytics-connector-specs; refer to that for definitive information.
Follow the instructions to create a Personal Access Token in your instance. As this is coupled to a specific User in Jira, we recommend first creating a dedicated Jira user to be a "Service Account" in effect for the connection (name it svc-worklytics
or something). This will give you better visibility into activity of the data connector as well as avoid connection inadvertently breaking if the Jira user who owns the token is disabled or deleted.
That service account must have READ permissions over your Jira instance, to be able to read issues, worklogs and comments, including their changelog where possible.
If you're required to specify a classical scope, you can add:
read:jira-work
Disable or set a reasonable expiration time for the token. If you set an expiration time, it is your responsibility to re-generate the token and reset it in your host environment to maintain your connection.
Copy the value of the token in PSOXY_JIRA_SERVER_ACCESS_TOKEN
variable as part of AWS System Manager Parameter Store / GCP Cloud Secrets.
For customers wishing to using New Relic to monitor their proxy instances, we have alpha support for this in AWS. We provide no guarantee as to how it works, nor as to whether its behavior will be maintained in the future.
To enable,
Set your proxy release to v0.4.39.alpha.new-relic.1
.
Add the following to your terraform.tfvars
to configure it:
(if you already have a defined general_environment_variables
variable, just add the NEW_RELIC_
variables to it)
Google Workspace sources can be setup via Terraform, using modules found in our GitHub repo.
As of August 2023, we suggest you use one of our template repo, eg:
Within those, the google-workspace.tf
and google-workspace-variables.tf
files in those repos specify the terraform configuration to use Google Workspace sources.
You (the user running Terraform) must have the following roles (or some of the permissions within them) in the GCP project in which you will provision the OAuth clients that will be used to connect to your Google Workspace data:
create Service Accounts to be used as API clients
to access Google Workspace API, proxy must be authenticated by a key that you need to create
you will need to enable the Google Workspace APIs in your GCP Project
As these are very permissive roles, we recommend that you use a dedicated GCP project so that these roles are scoped just to the Service Accounts used for this deployment. If you used a shared GCP project, these roles would give you access to create keys for ALL the service accounts in the project, for example - which is not good practice.
Additionally, a Google Workspace Admin will need to make a Domain-wide Delegation grant to the Oauth Clients you create. This is done via the Google Workspace Admin console. In default setup, this requires Super Admin role, but your organization may have a Custom Role with sufficient privileges.
We also recommend you create a dedicated Google Workspace user for Psoxy to use when connecting to your Google Workspace Admin API, with the specific permissions needed. This avoids the connection being dependent on a given human user's permissions and improves transparency.
This is not to be confused with a GCP Service Account. Rather, this is a regular Google Workspace user account, but intended to be assigned to a service rather than a human user. Your proxy instance will impersonate this user when accessing the Google Admin Directory and Reports APIs. (Google requires that these be accessed via impersonation of a Google user account, rather than directly using a GCP service account).
We recommend naming the account svc-worklytics@{your-domain.com}
.
If you have already created a sufficiently privileged service account user for a different Google Workspace connection, you can re-use that one.
Assign the account a sufficiently privileged role. At minimum, the role must have the following privileges, read-only:
Admin API
Domain Settings
Groups
Organizational Units
Reports (required only if you are connecting to the Audit Logs, used for Google Chat, Meet, etc)
Users
Those refer to Google's documentation, as shown below (as of Aug 2023); you can refer there for more details about these privileges.
The email address of the account you created will be used when creating the data connection to the Google Directory in the Worklytics portal. Provide it as the value of the 'Google Account to Use for Connection' setting when they create the connection.
If you choose not to use a predefined role that covers the above, you can define a Custom Role.
Using a Custom Role, with 'Read' access to each of the required Admin API privileges is good practice, but least-privilege is also enforced in TWO additional ways:
the Proxy API rules restrict the API endpoints that Worklytics can access, as well as the HTTP methods that may be used. This enforces read-only access, limited to the required data types (and actually even more granular that what Workspace Admin privileges and OAuth Scopes support).
the Oauth Scopes granted to the API client via Domain-wide delegation. Each OAuth Client used by Worklytics is granted only read-only scopes, least-permissive for the data types required. eg https://www.googleapis.com/auth/admin.directory.users.readonly
.
So a least-privileged custom role is essentially a 3rd layer of enforcement.
In the Google Workspace Admin Console as of August 2023, creating a 'Custom Role' for this user will look something like the following:
YMMV - Google's UI changes frequently and varies by Google Workspace edition, so you may see more or fewer options than shown above. Please scroll the list of privileges to ensure you grant READ access to API for all of the required data.
Google Workspace APIs use OAuth 2.0 for authentication and authorization. You create an Oauth 2.0 client in Google Cloud Platform and a credential (service account key), which you store in as a secret in your Proxy instance.
When the proxy connects to Google, it first authenticates with Google API using this secret (a service account key) by signing a request for a short-lived access token. Google returns this access token, which the proxy then uses for subsequent requests to Google's APIS until the token expires.
The service account key can be rotated at any time, and the terraform configuration examples we provide can be configured to do this for you if applied regularly.
More information: https://developers.google.com/workspace/guides/auth-overview
To initially authorize each connector, a sufficiently privileged Google Workspace Admin must make a Domain-wide Delegation grant to the Oauth Client you create, by pasting its numeric ID and a CSV of of the required OAuth Scopes into the Google Workspace Admin console. This is a one-time setup step.
If you use the provided Terraform modules (namely, google-workspace-dwd-connection
), a TODO file with detailed instructions will be created for you, including the actual numeric ID and scopes required.
Note that while Domain-wide Delegation is a broad grant of data access, the implementation of it in proxy is mitigated in several ways because the GCP Service Account resides in your own GCP project, and remains under your organizes control - unlike the most common Domain-wide Delegation scenarios which have been the subject of criticism by security researchers. In particular:
you may directly verify the numeric ID of the service account in the GCP web console, or via the GCP CLI; you don't need to take our word for it.
you may monitor and log the use of each service account and its key as you see fit.
you can ensure there is never more than one active key for each service account, and rotate keys at any time.
the key is only used from infrastructure (GCP CLoud Function or Lambda) in your environment; you should be able to reconcile logs and usage between your GCP and AWS environments should you desire to ensure there has been no malicious use of the key.
While not recommended, it is possible to set up Google API clients without Terraform, via the GCP web console.
Create or choose the GCP project in which to create the OAuth Clients.
Activate relevant API(s) in the project.
Create a Service Account in the project; this will be the OAuth Client.
Get the numeric ID of the service account. Use this plus the oauth scopes to make domain-wide delegation grants via the Google Workspace admin console.
Then follow the steps in the next section to create the keys for the Oauth Clients.
NOTE: if you are creating connections to multiple Google Workspace sources, you can use a single OAuth client and share it between all the proxy instances. You just need to authorize the entire superset of Oauth scopes required by those connnections for the OAuth Client via the Google Workspace Admin console.
If your organization's policies don't allow GCP service account keys to be managed via Terraform (or you lack the perms to do so), you can still use our Terraform modules to create the clients, and just add the following to your terraform.tfvars
to disable provisioning of the keys:
Then you can create the keys manually, and store them in your secrets manager of choice.
For each API client you need to:
Create a JSON key for the service account (via GCP console or CLI)
Base64-encode the key; eg cat service-account.json | base64 | pbcopy
store it as a secret named should be something like PSOXY_GDIRECTORY_SERVICE_ACCOUNT_KEY
. Our Terraform modules should still create an instance of the secret in your host environment, just filled with a placeholder value.
For GCP Secrets manager, you can do (3) via CLI as follows: pbpaste | gcloud secrets versions add PSOXY_GCAL_SERVICE_ACCOUNT_KEY --data-file=- --project=YOUR_PROJECT_ID
For AWS Systems Manager Parameter Store, you can do (3) via CLI as follows: pbpaste | aws ssm put-parameter --name PSOXY_GCAL_SERVICE_ACCOUNT_KEY --type SecureString --value - --region us-east1
(NOTE: please refer to aws/gcloud docs for exact versions of commands above; YMMV, as this is not our recommended approach for managing keys)
If you are sharing a single OAuth client between multiple proxy instances, you just repeat step (3) for EACH client. (eg, store N copies of the key, all with the same value)
Whenever you want to rotate the key (which GCP recommends at least every 90 days), you must repeat the steps in this section (no need to create Service Account again; just create a new key for it and put the new version into Secrets Manager).
If you remain uncomfortable with Domain-wide Delegation, a private Google Marketplace App is a possible, if tedious and harder to maintain, alternative. Here are some trade-offs:
Pros:
Google Workspace Admins may perform a single Marketplace installation, instead of multiple DWD grants via the admin console
"install" from the Google Workspace Marketplace is less error-prone/exploitable than copy-paste a numeric service account ID
visual confirmation of the oauth scopes being granted by the install
ability to "install" for a Org Unit, rather than the entire domain
Cons:
you must use a dedicated GCP project for the Marketplace App; "installation" of a Google Marketplace App grants all the service accounts in the project access to the listed oauth scopes. You must undeterstand the the OAuth grant is to the project, not a specific service account.
you must enable additional APIs in the GCP project (marketplace SDK).
as of Dec 2023, Marketplace Apps cannot be completely managed by Terraform resources; so there are more out-of-band steps that someone must complete by hand to create the App; and a simple terraform destroy
will not remove the associated infrastructure. In contrast, terraform destroy
in the DWD approach will result in revocation of the access grants when the service account is deleted.
You must monitor how many service accounts exist in the project and ensure only the expected ons are created. Note that all Google Workspace API access, as of Dec 2023, requires the service account to authenticate with a key; so any SA without a key provisioned cannot access your data.
NOTE: This is for the Cloud-hosted version of Jira; for the self-hosted version, see Jira Server.
NOTE: These instructions are derived from worklytics-connector-specs; refer to that for definitive information.
Jira Cloud through Psoxy uses Jira OAuth 2.0 (3LO), which a Jira Cloud (user) account with following classical scopes:
read:jira-user
: for getting generic user information
read:jira-work
: for getting information about issues, comments, etc
And following granular scopes:
read:account
: for getting user emails
read:group:jira
: for retrieving group members
read:avatar:jira
: for retrieving group members
You will need a web browser and a terminal with curl
available (such as macOS terminal, Linux, an AWS CLoud Shell, Windows Subsystem for Linux, etc)
Go to https://developer.atlassian.com/console/myapps/ and click on "Create" and choose "OAuth 2.0 Integration"
Then click "Authorization" and "Add" on OAuth 2.0 (3L0)
, adding http://localhost
as callback URI. It can be any URL that matches the URL format and it is required to be populated, but the proxy instance workflow will not use it.
Now navigate to "Permissions" and click on "Add" for Jira API
. Once added, click on "Configure". Add following scopes as part of "Classic Scopes", first clicking on Edit Scopes
and then selecting them:
read:jira-user
read:jira-work
And these from "Granular Scopes":
read:group:jira
read:avatar:jira
read:user:jira
Then go back to "Permissions" and click on "Add" for User Identity API
, only selecting following scopes:
read:account
After adding all the scopes, you should have 1 permission for User Identity API
and 5 for Jira API
:
Once Configured, go to "Settings" and copy the "Client Id" and "Secret". You will use these to obtain an OAuth refresh_token
.
Build an OAuth authorization endpoint URL by copying the value for "Client Id" obtained in the previous step into the URL below. Then open the result in a web browser:
https://auth.atlassian.com/authorize?audience=api.atlassian.com&client_id=<CLIENT ID>&scope=offline_access%20read:group:jira%20read:avatar:jira%20read:user:jira%20read:account%20read:jira-user%20read:jira-work&redirect_uri=http://localhost&state=YOUR_USER_BOUND_VALUE&response_type=code&prompt=consent
6. Choose a site in your Jira workspace to allow access for this application and click "Accept". As the callback does not exist, you will see an error. But in the URL of your browser you will see something like this as URL:
http://localhost/?state=YOUR_USER_BOUND_VALUE&code=eyJhbGc...
Copy the value of the code
parameter from that URI. It is the "authorization code" required for next step.
NOTE This "Authorization Code" is single-use; if it expires or is used, you will need to obtain a new code by again pasting the authorization URL in the browser.
Now, replace the values in following URL and run it from command line in your terminal. Replace YOUR_AUTHENTICATION_CODE
, YOUR_CLIENT_ID
and YOUR_CLIENT_SECRET
in the placeholders:
curl --request POST --url 'https://auth.atlassian.com/oauth/token' --header 'Content-Type: application/json' --data '{"grant_type": "authorization_code","client_id": "YOUR_CLIENT_ID","client_secret": "YOUR_CLIENT_SECRET", "code": "YOUR_AUTHENTICATION_CODE", "redirect_uri": "http://localhost"}'
After running that command, if successful you will see a JSON response like this:
Set the following variables in AWS System Manager parameters store / GCP Cloud Secrets (if default implementation):
PSOXY_JIRA_CLOUD_ACCESS_TOKEN
secret variable with value of access_token
received in previous response
PSOXY_JIRA_CLOUD_REFRESH_TOKEN
secret variable with value of refresh_token
received in previous response
PSOXY_JIRA_CLOUD_CLIENT_ID
with Client Id
value.
PSOXY_JIRA_CLOUD_CLIENT_SECRET
with Client Secret
value.
Obtain the "Cloud ID" of your Jira instance. Use the following command, with the access_token
obtained in the previous step in place of <ACCESS_TOKEN>
below:
curl --header 'Authorization: Bearer <ACCESS_TOKEN>' --url 'https://api.atlassian.com/oauth/token/accessible-resources'
And its response will be something like:
In your Terraform configuration's terraform.tfvars
file, set the jira_cloud_id
variable to the id
value from the JSON response. This will ensure that all test URLs are generated with the correct value, targeting a valid Jira Cloud instance.
NOTE: A "token family" includes the initial access/refresh tokens generated above as well as all subsequent access/refresh tokens that Jira returns to any future token refresh requests. By default, Jira enforces a maximum lifetime of 1 year for each token family. So you MUST repeat steps 5-9 at least annually or your proxy instance will stop working when the token family expires.
The Psoxy HRIS (human resource information system) connector is intended to sanitize data exported from an HRIS/HCM system which you intend to transfer to Worklytics. The expected format is a CSV file, as defined in the documentation for import data (obtain from Worklytics).
See: https://docs.worklytics.co/knowledge-base/connectors/bulk-data/hris-snapshots
The default proxy rules for hris
will pseudonymize EMPLOYEE_ID
, EMPLOYEE_EMAIL
, MANAGER_ID
- as well as a MANAGER_EMAIL
column if it's included.
If your HRIS data does not match the expected schema above, you can customize the proxy rules to perform some basic ETL-like transforms on the data within the proxy itself:
Connecting to Microsoft 365 data requires:
creating one Microsoft Entra ID (former Azure Active Directory, AAD) application per Microsoft 365 data source (eg, msft-entra-id
, outlook-mail
, outlook-cal
, etc).
configuring an authentication mechanism to permit each proxy instance to authenticate with the Microsoft Graph API. (since Sept 2022, the supported approach is federated identity credentials)
granting admin consent to each Entra ID enterprise application to specific scopes of Microsoft 365 data the connection requires.
Steps (1) and (2) are handled by the terraform
examples. To perform them, the machine running terraform
must be authenticated with Azure CLI as a Microsoft Entra ID user with, at minimum, the following role in your Microsoft 365 tenant:
Cloud Application Administrator to create/update/delete Entra ID applications and its settings during Terraform apply command.
Please note that this role is the least-privileged role sufficient for this task (creating a Microsoft Entra ID Application), per Microsoft's documentation. See Least privileged roles by task in Microsoft Entra ID.
This role is needed ONLY for the initial terraform apply
. After each Azure AD enterprise application is created, the user will be set as the owner
of that application, providing ongoing access to read and update the application's settings. At that point, the general role can be removed.
Step (3) is performed via the Microsoft Entra ID web console through an user with administrator permissions. Running the terraform
examples for steps (1)/(2) will generate a document with specific instructions for this administrator. This administrator must have, at minimum, the following role in your Microsoft 365 tenant:
Privileged Role Administrator to Consent to application permissions to Microsoft Graph
Again, this is the least-privileged role sufficient for this task, per Microsoft's documentation. See Least privileged roles by task in Microsoft Entra ID.
Psoxy uses Federated Identity Credentials to authenticate with the Microsoft Graph API. This approach avoids the need for any secrets to be exchanged between your Psoxy instances and your Microsoft 365 tenant. Rather, each API request from the proxy to Microsoft Graph API is signed by an identity credential generated in your host cloud platform. You configure your Azure AD application for each connection to trust this identity credential as identifying the application, and Microsoft trusts your host cloud platform (AWS/GCP) as an external identity provider of those credentials.
Neither your proxy instances nor Worklytics ever hold any API key or certificate for your Microsoft 365 tenant.
See Microsoft Workload Identity Federation docs for details. Specifically, the relevant scenario is workload running in either GCP or AWS (your proxy host platform)
The video below explains the general idea for identity federation for Azure AD-gated resources more generally, of which your Graph API is an example:
The following Scopes are required for each connector. Note that they are all READ-only scopes.
Entra ID
Calendar
Teams (beta)
NOTE: the above scopes are copied from infra/modules/worklytics-connector-specs. They are accurate as of 2023-04-12. Please refer to that module for a definitive list.
NOTE: that Mail.ReadBasic
affords only access to email metadata, not content/attachments.
NOTE: These are all 'Application' scopes, allowing the proxy itself data access as an application, rather than on behalf of a specific authenticated end-user ('Delegated' scopes).
If you do not have the 'Cloud Application Administrator' role, someone with that or an alternative role that can create Azure AD applications can create one application per connection and set you as an owner of each.
You can then import
these into your Terraform configuration.
First, try terraform plan | grep 'azuread_application'
to get the Terraform addresses for each application that your configuration will create.
Second, ask your Microsoft admin to create an application for each of those, set you as the owner, and send you the Object ID
for each.
Third, use terraform import <address> <object-id>
to import each application into your Terraform state.
At that point, you can run terraform apply
and it should be able to update the applications with the settings necessary for the proxy to connect to Microsoft Graph API. After that apply, you will still need a Microsoft 365 admin to perform the admin consent step for each application.
See https://registry.terraform.io/providers/hashicorp/azuread/latest/docs/resources/application#import for details.
DEPRECATED - will be removed in v0.5; this is not recommended approach, for a variety of reasons, since Microsoft released support for federated credentials in ~Sept 2022. See our module azuread-federated-credentials
for preferred alternative.
Psoxy's terraform modules create certificates on your machine, and deploy these to Azure and the keys to your AWS/GCP host environment. This all works via APIs.
Sometimes Azure is a bit finicky about certificate validity dates, and you get an error message like this:
Just running terraform apply
again (and maybe again) usually fixes it. Likely it's something with with Azure's clock relative to your machine, plus whatever flight time is required between cert generation and it being PUT to Azure.
Connect Outlook Calendar data to Worklytics, enabling meeting analysis and general collaboration insights based on collaboration via Outlook Calendar. Includes user enumeration to support fetching calendars from each account; and group enumeration to expand attendance/invitations to meetings via mailing list (groups).
Please review the Microsoft 365 README for general information applicable to all Microsoft 365 connectors.
See the Microsoft 365 Authentication section of the main README.
See the Microsoft 365 Authorization section of the main README.
/v1.0/me/events
/v1.0/me/events/{eventId}
/v1.0/me/calendar/calendarView
/v1.0/me/calendar/events
Assuming proxy is auth'd as an application, you'll have to replace me
with your MSFT ID or UserPrincipalName
(often your email address).
See more examples in the docs/sources/microsoft-365/msft-teams/example-api-responses
folder of the Psoxy repository.
Connect Microsoft Teams data to Worklytics, enabling communication analysis and general collaboration insights based on collaboration via Microsoft Teams. Includes user enumeration to support fetching mailboxes from each account; and group enumeration to expand emails via mailing list (groups).
Please review the Microsoft 365 README for general information applicable to all Microsoft 365 connectors.
See the Microsoft 365 Authentication section of the main README.
See the Microsoft 365 Authorization section of the main README.
Besides of having OnlineMeetings.Read.All
and OnlineMeetingArtifact.Read.All
scope defined in the application, you need to allow a new role and a policy on the application created for reading OnlineMeetings. You will need Powershell for this.
Please follow the steps below:
Ensure the user you are going to use for running the commands has the "Teams Administrator" role. You can add the role in the Microsoft 365 Admin Center
NOTE: It can be assigned through Entra Id portal in Azure portal OR in Entra Admin center https://admin.microsoft.com/AdminPortal/Home. It is possible that even login with an admin account in Entra Admin Center the Teams role is not available to assign to any user; if so, please do it through Azure Portal (Entra Id -> Users -> Assign roles)
Install PowerShell Teams module.
Run the following commands in Powershell terminal:
And use the user with the "Teams Administrator" for login it.
Follow steps on Configure application access to online meetings or virtual events:
Add a policy for the application created for the connector, providing its application id
Grant the policy to the whole tenant (NOT to any specific application or user)
Issues:
If you receive "access denied" is because no admin role for Teams has been detected. Please close and reopen the Powershell terminal after assigning the role.
Commands have been tested over a Powershell (7.4.0) terminal in Windows, installed from Microsoft Store and with Teams Module (5.8.0). It might not work on a different environment
/v1.0/teams
/v1.0/teams/{teamId}/allChannels
/v1.0/teams/{teamId}/channels/{channelId}/messages
/v1.0/users/{userId}/chats
/v1.0/users/{userId}/onlineMeetings
/v1.0/users/{userId}/onlineMeetings/{meetingId}/attendanceReport/{reportId}
See more examples in the docs/sources/microsoft-365/msft-teams/example-api-responses
folder
NOTE for pseudonymizing app ids
In case of pseudonymize_app_ids
is set to true
, the userId
and chatId
fields will be tokenized. In such case and if you want to populate example variables like example_msft_user_guid
or example_msft_chat_guid
in the example responses, you will need first to get a list of user and use the id
in the variable. Using a plain user id without tokenization might not work on endpoints that require a tokenized user id.
Example Data : |
See more examples in the docs/sources/slack/example-api-responses
folder of the .
For enabling Slack Discovery with the Psoxy you must first set up an app on your Slack Enterprise instance.
Go to https://api.slack.com/apps and create an app.
Select "From scratch", choose a name (for example "Worklytics connector") and a development workspace
Take note of your App ID (listed in "App Credentials"), contact your Slack representative and ask them to enable discovery:read
scope for that App ID. If they also enable discovery:write
then delete it for safety, the app just needs read access.
The next step depends on your installation approach you might need to change slightly
Use this step if you want to install in the whole org, across multiple workspaces.
Add a bot scope (not really used, but Slack doesn't allow org-wide installations without a bot scope). The app won't use it at all. Just add for example the users:read
scope, read-only.
Under "Settings > Manage Distribution > Enable Org-Wide App installation", click on "Opt into Org Level Apps", agree and continue. This allows to distribute the app internally on your organization, to be clear it has nothing to do with public distribution or Slack app directory.
Generate the following URL replacing the placeholder for YOUR_CLIENT_ID and save it for
https://api.slack.com/api/oauth.v2.access?client_id=YOUR_CLIENT_ID
Go to "OAuth & Permissions" and add the previous URL as "Redirect URLs"
Go to "Settings > Install App", and choose "Install to Organization". A Slack admin should grant the app the permissions and the app will be installed.
Copy the "User OAuth Token" (also listed under "OAuth & Permissions") and store as PSOXY_SLACK_DISCOVERY_API_ACCESS_TOKEN
in the psoxy's Secret Manager. Otherwise, share the token with the AWS/GCP administrator completing the implementation.
Use this steps if you intend to install in just one workspace within your org.
Go to "Settings > Install App", click on "Install into workspace"
Copy the "User OAuth Token" (also listed under "OAuth & Permissions") and store as PSOXY_SLACK_DISCOVERY_API_ACCESS_TOKEN
in the psoxy's Secret Manager. Otherwise, share the token with the AWS/GCP administrator completing the implementation.
For clarity, example files are NOT compressed, so don't have .gz
extension; but rules expect .gz
.
Example Data : |
See more examples in the docs/sources/salesforce/example-api-responses
folder of the .
Before running the example, you have to populate the following variables in terraform:
salesforce_domain
. This is the your instance is using.
salesforce_example_account_id
: An example of any account id; this is only applicable for example calls.
Create a with following permissions:
Manage user data via APIs (api
)
Access Connect REST API resources (chatter_api
)
Perform requests at any time (refresh_token
, offline_access
)
Access unique user identifiers (openid
)
Access Lightning applications (lightning
)
Access content resources (content
)
Perform ANSI SQL queries on Customer Data Platform data (cdp_query_api
)
Apart from Salesforce instructions above, please review the following:
"Callback URL" MUST be filled; can be anything as not required in this flow, but required to be set by Salesforce.
Application MUST be marked with "Enable Client Credentials Flow"
You MUST assign a user for Client Credentials, be sure:
you associate a "run as" user marked with "API Only Permission"
The policy associated to the user MUST have the following Administrative Permissions enabled:
API Enabled
APEX REST Services
The policy MUST have the application marked as "enabled" in "Connected App Access". Otherwise requests will return 401 with INVALID_SESSION_ID
The user set for "run as" on the connector should have, between its Permission Sets
and Profile
, the permission of View All Data
. This is required to support the queries used to retrieve by account id.
Once created, open "Manage Consumer Details"
Update the content of PSOXY_SALESFORCE_CLIENT_ID
from Consumer Key and PSOXY_SALESFORCE_CLIENT_SECRET
from Consumer Secret
Finally, we recommend to run test-salesforce
script with all the queries in the example to ensure the expected information covered by rules can be obtained from Salesforce API. Some test calls may fail with a 400 (bad request) response. That is something expected if parameters requested on the query are not available (for example, running a SOQL query with fields that are NOT present in your model will force a 400 response from Salesforce API). If that is the case, a double check in the function logs can be done to ensure that this is the actual error happening, you should see an error like the following one: json WARNING: Source API Error [{ "message": "\nLastModifiedById,NumberOfEmployees,OwnerId,Ownership,ParentId,Rating,Sic,Type\n ^\nERROR at Row:1:Column:136\nNo such column 'Ownership' on entity 'Account'. If you are attempting to use a custom field, be sure to append the '__c' after the custom field name. Please reference your WSDL or the describe call for the appropriate names.", "errorCode": "INVALID_FIELD" }]
In that case, removing from the query the fields LastModifiedById,NumberOfEmployees,OwnerId,Ownership,ParentId,Rating,Sic,Type will fix the issues.
However, if running any of the queries you receive a 401/403/500/512. A 401/403 it might be related to some misconfiguration in the Salesforce Application due lack of permissions; a 500/512 it could be related to missing parameter in the function configuration (for example, a missing value for salesforce_domain
variable in your terraform vars) NOTE: derived from ; refer to that for definitive information.
Connect Outlook Mail data to Worklytics, enabling communication analysis and general collaboration insights based on collaboration via Outlook Mail. Includes user enumeration to support fetching mailboxes from each account; and group enumeration to expand emails via mailing list (groups).
Please review the for general information applicable to all Microsoft 365 connectors.
See the section of the main README.
See the section of the main README.
Assuming proxy is auth'd as an application, you'll have to replace me
with your MSFT ID or UserPrincipalName
(often your email address).
-
-
-
-
of the .
beta As an alternative to connecting Worklytics to the Slack Discovery API via the proxy, it is possible to use the bulk-mode of the proxy to sanitize an export of Slack Discovery data and ingest the resulting sanitized data to Worklytics. Example data of this is given in the folder.
This data can be processing using custom multi-file type rules in the proxy, of which is an example.
See more examples in the docs/sources/microsoft-365/msft-teams/example-api-responses
folder of the .
/v1.0/me/mailFolders/SentItems/messages
/v1.0/me/messages/{messageId}
/v1.0/me/mailboxSettings
Yes, Psoxy supports filtering bulk (flat) files or API responses to remove PII or other sensitive data prior to transfer to a 3rd party. You configure it with a static set of rules, providing customizable sanitization behavior of fields. Psoxy supports complex JsonPath expressions if needed, to perform santization generally across many fields and endpoints.
Yes, but only to a broad set of IP blocks that are not exclusive to your Worklytics tenant. As requests from your Worklytics tenant to your Psoxy instances are authenticated via identity federation (OIDC) and authorized by your Cloud providers IAM policies, IP-based restrictions are not necessary.
If you take this approach, you will be responsible for updating your IP restrictions frequently as GCP changes their IP blocks, or your data flow to Worklytics may break. As such, this is not officially supported by Worklytics. For an example of how to do this, see worklytics-ip-blocks module.
Your Worklytics tenant is a process running in GCP, personified by a unique GCP service account. You simply use your cloud's IAM to grant that service account access to your psoxy instance.
This is functionally equivalent to how access is authenticated and authorized to within and between any public cloud infrastructure. Eg, access to your S3 buckets is authorized via a policy you specify in AWS IAM.
Remember that Psoxy is, in effect, a drop-in replacement for a data sources API; in general, these APIs, such as for Google Workspace, Slack, Zoom, and Microsoft 365, are already accessible from anywhere on the internet without IP restriction. Psoxy exposes only a more restricted view of the source API - a subset of its endpoints, http methods (read-only), and fields - with field values that contain PII redacted or pseudonymized.
See AWS Authentication and Authorization for more details.
See GCP Authentication and Authorization for more details.
And always remember: an IP is not an authenticated identity for a client, and should not be relied upon as an authentication mechanism. IPs can be spoofed. It is at best an extra control.
Yes - and prior to March 2022 this was necessary. But AWS has released Lambda function urls , which provide a simpler and more direct way to securely invoke lambdas via HTTP. As such, the Worklytics-provided Terraform modules use function URLs rather than API gateways.
API gateways provide a layer of indirection that can be useful in certain cases, but is overkill for psoxy deployments - which do little more than provide a transformed, read-only view of a subset of endpoints within a data source API. The indirection provides flexibility and control, but at the cost of complexity in infrastructure and management - as you must provision a gateway, route, stage, and extra IAM policies to make that all work, compared to a function URL.
That said, the payload lambdas receive when invoked via a function URL is equivalent to the payload of API Gateway v2, so the proxy itself is compatible with either API Gateway v2 or function urls.
See API Gateway for more details on how to use Worklytics-provided terraform modules to enable API gateway in front of your proxy instances.
Sure, but why? Psoxy is itself a rules-based layer that validates requests, authorizes them, and then sanitizes the response. It is a drop-in replacement for the API of your data source, which in many cases are publicly exposed to the internet and likely implement their own WAF.
Psoxy never exposes more data than is in the source API itself, and in the usual case it provides read-only access to a small subset of API endpoints and fields within those endpoints.
Psoxy is stateless, so all requests must go to the source API. Psoxy does not cache or store any data. There is no database to be vulnerable to SQL injections.
A WAF could make sense if you are using Psoxy to expose an on-prem, in-house built tool to Worklytics that is otherwise not exposed to the internet.
VPC support is available as a beta feature as of February 2024.
VPC usage requires an API Gateway to be deployed in front of the proxy instances.
Please note that proxy instances generally use the public APIs of cloud SaaS tools, so do not require access to your internal network/VPN unless you are connecting to an on-prem tool (eg, GitHub Enterprise Server, etc). So there is no technical reason to deploy Psoxy instances in a VPC.
As such, only organizations with inflexible policies requiring such infra to be in a VPC should add this complexity. Security is best achieved by simplicity and transparency, so deploying VPC and API Gateway for its own sake does not improve security.
see: VPC Support
DWD deserves scrutiny. It is broad grant of data access, generally covering all Google accounts in your workspace domain. And the UX - pasting a numeric service account ID and a CSV of oauth scopes - creates potential for errors/exploitation by malicious actors.
To use DWD securely, you must trust the numeric ID; in a typical scenario, where someone or some web app is asking you to paste this ID into a form, this is a risk. It is NOT a 3-legged oauth flow, where the redirects between
However, the Psoxy workflow mitigates this risk in several ways:
DWD grants required for Psoxy connections are made to your own service accounts, provisioned by you and residing in your own GCP project. They do not belong to a 3rd party. As such you need not trust a number shown to you in a web app or email; you can use the GCP web console, CLI, etc to confirm the validity of the service account ID independently.
Your GCP logs can provide transparency into the usage of the service account, to validate what data it is being used to access, and from where.
You remain in control of the only key that can be used to authenticate as the service account - you may revoke/rotate this key at any moment should you suspect malicious activity.
Hence, using DWD via Psoxy is more secure than the typical DWD scenario that many security researchers complain about.
If you remain uncomfortable with DWD, a private Google Marketplace App is a possible alternative, albeit more tedious to configure. It requires a dedicated GCP project, with additional APIs enabled in the project.
No. ABAC is specifying an access control policy predicated on attributes of the object/resource being accessed. The approach of Psoxy is better described as Attribute-level Access Control, where the access control policy can be written to limit access to specific attritibutes (fields) within an object/resource.
Eg, evaluation of an ABAC policy still results in boolean, allow/deny decision on the request; Psoxy policy (rule) evaluation results in a modified response, with specific fields redacted or transformed in accordance with the policy.
Example commands (*) that you can use to validate proxy behavior against the Zoom APIs. Follow the steps and change the values to match your configuration when needed.
For AWS, change the role to assume with one with sufficient permissions to call the proxy (-r
flag). Example:
If any call appears to fail, repeat it using the -v
flag.
(*) All commands assume that you are at the root path of the Psoxy project.
Now pull out a user id ([zoom_user_id]
, accessor path in response .users[0].id
). Next call is bound to a single user:
First pull out a meeting id ([zoom_meeting_id]
, accessor path in response .meetings[0].id
):
Example commands (*) that you can use to validate proxy behavior against the Slack Discovery APIs. Follow the steps and change the values to match your configuration when needed.
For AWS, change the role to assume with one with sufficient permissions to call the proxy (-r
flag). Example:
If any call appears to fail, repeat it using the -v
flag.
(*) All commands assume that you are at the root path of the Psoxy project.
Get a workspace ID (accessor path in response .enterprise.teams[0].id
):
Get conversation details of that workspace (replace workspace_id
with the corresponding value):
Get a channel ID (accessor path in response .channels[0].id
):
Get DM information (no workspace):
Read messages for workspace channel:1
Omit the workspace ID if channel is a DM
Omit the workspace ID if channel is a DM
As of July 2023, pulling historical data (last 6 months) and all scheduled and instant meetings requires a Zoom paid account on Pro or higher plan (Business, Business Plus). On other plans Zoom data may be incomplete.
Accounts on unpaid plans do not have access to some methods Worklytics use like:
-required for historical data
certain methods such as retrieving
Example Data : |
See more examples in the docs/sources/zoom/example-api-responses
folder of the .
The Zoom connector through Psoxy requires a Custom Managed App on the Zoom Marketplace. This app may be left in development mode; it does not need to be published.
Go to https://marketplace.zoom.us/develop/create and create an app of type "Server to Server OAuth" for creating a server-to-server app.
After creation, it will show the App Credentials.
Copy the following values:
Account ID
Client ID
Client Secret
Share them with the AWS/GCP administrator, who should fill them in your host platform's secret manager (AWS Systems Manager Parameter Store / GCP Secret Manager) for use by the proxy when authenticating with the Zoom API:
Account ID
--> PSOXY_ZOOM_ACCOUNT_ID
Client ID
--> PSOXY_ZOOM_CLIENT_ID
Client Secret
--> PSOXY_ZOOM_CLIENT_SECRET
NOTE: Anytime the Client Secret is regenerated it needs to be updated in the Proxy too. NOTE: Client Secret should be handled according to your organization's security policies for API keys/secrets as, in combination with the above, allows access to your organization's data.
Fill the 'Information' section. Zoom requires company name, developer name, and developer email to activate the app.
No changes are needed in the 'Features' section. Continue.
Fill the scopes section clicking on + Add Scopes
and adding the following:
meeting:read:past_meeting:admin
meeting:read:meeting:admin
meeting:read:list_past_participants:admin
meeting:read:list_past_instances:admin
meeting:read:list_meetings:admin
meeting:read:participant:admin
report:read:list_meeting_participants:admin
report:read:meeting:admin
report:read:user:admin
user:read:user:admin
user:read:list_users:admin
Alternatively, the scopes: user:read:admin
, meeting:read:admin
, report:read:admin
are sufficient, but as of May 2024 are no longer available for newly created Zoom apps.
Once the scopes are added, click on Done
and then Continue
.
Activate the app
JSON Filter is inspired by JSON Schema, but with the goal to filter documents rather than validate. As such, the basic idea is that data nodes that do not match the filter schema are removed, rather than the whole document failing validation.
The goal of JsonFilter is that only data elements specified in the filter pass through.
Some differences:
required
properties are ignored. While in JSON schema, an object that was missing a "required" property is invalid, objects missing "required" properties in a filter will be preserved.
{ }
, eg, a schema without a type, is interpreted as any valid leaf node (eg, unconstrained leaf; everything that's not 'array' or 'object') - rather than any valid JSON.
Compatibility goals:
a valid JSON Schema is convertible to valid JSON filter (with JSON schema features not supported by JSON filter ignored)
{ }
as "any valid leaf node"compactness, esp when encoding filter as YAML. can put { }
instead of { "type": "string" }
flexibility; for filtering use-case, often you just care about which properties are/aren't passed, rather than 'string' vs 'number' vs 'integer'
"any valid JSON" is more common use case in validation than in filtering.
Connect to Directory data in Microsoft 365. This allows enumeration of all users, groups, and group members in your organization, to provide additional segmentation, timezone/workday information, etc.
Please review the Microsoft 365 README for general information applicable to all Microsoft 365 connectors.
See the Microsoft 365 Authentication section of the main README.
See the Microsoft 365 Authorization section of the main README.
/v1.0/groups/{group-id}/members
/v1.0/users
/v1.0/users/me
/v1.0/groups
Assuming proxy is auth'd as an application, you'll have to replace me
with your MSFT ID or UserPrincipalName
(often your email address).
See more examples in the docs/sources/microsoft-365/entra-id/example-api-responses
folder of the Psoxy repository.