Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
As of June 2023, the following resources provisioned by Psoxy modules support use of CMEKs:
Lambda function environment variables
SSM Parameters
Cloud Watch Log Groups
S3 Buckets
The psoxy-example-aws
example provides a project_aws_key_arn
variable, that, if provided, will be set as the encryption key for these resources. A few caveats:
The AWS principal your Terraform is running as must have permissions to encrypt/decrypt with the key (it needs to be able to read/write the lambda env, ssm params, etc)
The key should be in the same AWS region you're deploying to.
CloudWatch must be able to use the key, as described in AWS CloudWatch docs
In example-dev/aws-all/kms-cmek.tf
, we provide a bunch of lines that you can uncomment to use encryption on S3 and properly set key policy to support S3/CloudWatch use.
For production use, you should adapt the key policy to your environment and scope as needed to follow your security policies, such as principle of least privilege.
If you need more granular control of CMEK by resource type, review the main.tf
and variables exposed by the aws-host
module for some options.
YMMV; as of June 2023, AWS's 1GB limit on cloud shell persistent storage is too low for real world proxy deployments, which typically require install gcloud CLI / Azure CLI to connect to sources
So use use your local machine, or a VM/container elsewhere in AWS (EC2, AWS Cloud9, etc
clone the repo
add the following lines to your ~/.bashrc
. (AWS Cloud Shell preserves only your HOME directory across sessions, so add any commands that modify/install things outside to your .bashrc
)
Then source ~/.bashrc
, to execute the above.
install Terraform
if using Google Workspace data sources, install Google Cloud CLI and authenticate.
if using Microsoft 365 data sources, install Azure CLI and authenticate.
https://docs.microsoft.com/en-us/cli/azure/install-azure-cli
You should now be ready for the general instructions in the README.md.
If default NodeJS tooling doesn't work for you, legacy testing tools use python/awscurl, installed via pip. See example below:
By default, Psoxy uses AWS Systems Manager Parameter Store to store secrets; this simplifies configuration and minimizes costs. However, you may want to use AWS Secrets Manager to store secrets due to organization policy.
In such a case, you can add the following to your terraform.tfvars
file:
This will alter the behavior of the Terraform modules to store everything considered a secret to be stored/loaded from AWS Secrets Manager instead of AWS Systems Manager Parameter Store. Note that Parameter Store is still used for non-secret configuration information, such as proxy rules, etc.
Changes will also be made to AWS IAM Policies, to allow lambda function execution roles to access Secrets Manager as needed.
If any secrets are managed outside of Terraform (such as API keys for certain connectors), you will need to grant access to relevant secrets in Secrets Manager to the principals that will manage these.
Required:
Optional:
AWS SAM CLI (macOS) for local testing, if desired
awscurl for direct testing of deployed AWS lambda from a terminal
Maven build produces a zip file.
Build core library
From java/impl/aws/
:
Locally, you can test function's behavior from invocation on a JSON payload (but not how the API gateway will map HTTP requests to that JSON payload):
https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-using-invoke.html
We recommend deploying your Psoxy code into AWS using the terraform modules found in [infra/modules/
](../../infra/modules/] for AWS. These modules both provision the required AWS infrastructure, as well as deploying the built binaries for Psoxy as lambdas in the target account.
Example configurations using those modules can be found in `infra/examples/.
You'll ultimately provision infrastructure represented in green in the following diagram:
![AWS data flow](./2022-02 Psoxy Data Flow.png)
See infra/modules/aws/
for more information.
beta - This is now available for customer-use, but may still change in backwards incompatible ways.
Our aws-host
module provides a vpc_config
variable to specify the VPC configuration for the lambdas that our Terraform modules will create, analogous to the vpc_config
block supported by the AWS lambda terraform resource.
Some caveats:
API connectors on a VPC must be exposed via API Gateway rather than Function URLs (our Terraform modules will make this change for you).
VPC must be configured such that your lambda has connectivity to AWS services including S3, SSM, and CloudWatch Logs; this is typically done by adding a VPC Endpoint for each service.
VPC must allow any API connector to connect to data source APIs via HTTPS (eg 443); usually these APIs are on the public internet, so this means egress to public internet.
VPC must allow your API gateway to connect to your lambdas.
The requirements above MAY require you to modify your VPC configuration, and/or the security groups to support proxy deployment. The example we provide in our vpc.tf
should fulfill this if you adapt it; or you can use it as a reference to adapt you existing VPC.
To put the lambdas created by our terraform example under a VPC, please follow one of the approaches documented in the next sections.
If you have an existing VPC, you can use it with the vpc_config
variable by hard coding the ids of the pre-existing resources (provisioned outside the scope of your proxy's terraform configuration).
vpc.tf
If you don't have a pre-existing VPC, you wish to use, our aws example repo includes vpc.tf
file at the top-level. This file has a bunch of commented-out terraform resource blocks that can serve as examples for creating the minimal VPC + associated infra. Review and uncomment to meet your use-case.
Prerequisites:
the AWS principal (user or role) you're using to run Terraform must have permissions to manage VPCs, subnets, and security groups. The AWS managed policy AmazonVPCFullAccess
provides this.
all pre-requisites for the api-gateways (see api-gateway.md)
NOTE: if you provide vpc_config
, the value you pass for use_api_gateway_v2
will be ignored; using a VPC requires API Gateway v2, so will override value of this flag to true
.
Add the following to "psoxy" module in your main.tf
(or uncomment if already present):
Uncomment the relevant lines in vpc.tf
in the same directory, and modify as you wish. This file pulls the default VPC/subnet/security group for your AWS account under terraform.
Alternatively, you modify vpc.tf
to use a provision non-default VPC/subnet/security group, and reference those from your main.tf
- subject to the caveats above.
See the following terraform resources that you'll likely need:
Check your Cloud Watch logs for the lambda. Proxy lambda will time out in INIT phase if SSM Parameter Store or your secret store implementation (AWS Secrets Manager, Vault) is not reachable.
Some potential causes of this:
DNS failure - it's going to look up the SSM service by domain; if the DNS zone for the SSM endpoint you've provisioned is not published on the VPC, this will fail; similarly, if the endpoint wasn't configured on a subnet - then it won't have an IP to be resolved.
if the IP is resolved, you should see failure to connect to it in the logs (timeouts); check that your security groups for lambda/subnet/endpoint allow bidirectional traffic necessary for your lambda to retrieve data from SSM via the REST API.
Terraform with aws provider doesn't seem to play nice with lambdas/subnets; the subnet can't be destroyed w/o destroying the lambda, but terraform seems unaware of this and will just wait forever.
So:
destroy all your lambdas (terraform state list | grep aws_lambda_function
; then terraform destroy --target=
for each, remember '' as needed)
destroy the subnet terraform destroy --target=aws_subnet.main
https://docs.aws.amazon.com/lambda/latest/dg/foundation-networking.html
https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc.html
Some ideas on how to support scenarios and configuration requirements beyond what our default examples show:
If you're using our AWS example, it should support a default_tags
variable.
You can add the following in your terrform.tfvars
file to set tags on all resources created by the example configuration:
If you're not using our AWS example, you can add the following to your configuration, then you will need to modify the aws
provider block in your configuration to add a default_tags
. Example shown below:
See: [https://registry.terraform.io/providers/hashicorp/aws/latest/docs#default_tags]
To support extensibility, our Terraform examples/modules output the IDs/names of the major resources they create, so that you can compose them with other Terraform resources.
The aws-host
module outputs bulk_connector_instances
; a map of id => instance
for each bulk connector. Each of these has two attributes that correspond to the names of its related buckets:
sanitized_bucket_name
input_bucket_name
So in our AWS example, you can use these to enable logging, for example, you could do something like this: (YMMV, syntax etc should be tested)
Analogous approaches can be used to configure versioning, replication, etc;
Note that encryption, lifecycle, public_access_block are set by the Workltyics-provided modules, so you may have conflicts issues if you also try to set those outside.
beta - released from v0.4.50; YMMV, and may be subject to change.
The terraform modules we provide provision execution roles for each lambda function, and attach by default attach the appropriate AWS Managed Policy to each.
Specifically, this is AWSLambdaBasicExecutionRole
, unless you're using a VPC - in which case it is AWSLambdaVPCAccessExecutionRole
(https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AWSLambdaVPCAccessExecutionRole.html).
For organizations that don't allow use of AWS Managed Policies, you can use the aws_lambda_execution_role_policy_arn
variable to pass in an alternative which will be used INSTEAD of the AWS Managed Policy.
You'll provision the following to host Psoxy in AWS:
Lambda Functions
IAM Roles and Policies
System Manager Parameter Store Parameters
Cloud Watch Log Groups
S3 buckets, if using the 'bulk' mode to sanitize file data (such as CSVs)
Cognito Pools and Identities, if connecting to Microsoft 365 (Azure AD) data sources
The diagram below provides an architecture overview of the 'REST' and 'Bulk' use-cases.
An AWS Account in which to deploy Psoxy We strongly recommend you provision one specifically for use to host Psoxy, as this will create an implicit security boundary, reduce possible conflicts with other infra configured in the account, and simplify eventual cleanup.
You will need the numeric AWS Account ID for this account, which you can find in the AWS Console.
If your AWS organization enforces Service Control Policies, ensure that these are allow the AWS components required by Psoxy or exempt the AWS Account in which you will deploy Psoxy from these policies.
If your organization uses any sort of security control enforcement mechanism, you may have disable/provide exceptions to those controls for you initial deployment. Then generally those controls can be implemented later by extending our examples. Our protips page provides some guidance on how to extend the base examples to meet more extreme requirements.
A sufficiently privileged AWS Role You must have a IAM Role within the AWS account with sufficient privileges to (AWS managed policy examples linked):
create IAM roles + policies (eg IAMFullAccess)
create and update Systems Manager Parameters (eg, AmazonSSMFullAccess )
create and manage Lambdas (eg AWSLambda_FullAccess )
create and manage S3 buckets (eg AmazonS3FullAccess )
create Cloud Watch Log groups (eg CloudWatchFullAccess)
(Yes, the use of AWS Managed Policies results in a role with many privileges; that's why we recommend you use a dedicated AWS account to host proxy which is NOT shared with any other use case)
You will need the ARN of this role.
NOTE: if you're connecting to Microsoft 365 (Azure AD) data sources, you'll also need permissions to create AWS Cognito Identity Pools and add Identities to them, such as arn:aws:iam::aws:policy/AmazonCognitoPowerUser. Some AWS Organizations have Service Control Policies in place that deny this by default, even if you have an IAM role that allows it at an account level.
An authenticated AWS CLI in your provisioning environment. Your environment (eg, shell/etc from which you'll run terraform commands) must be authenticated as an identity that can assume that role. (see next section for tips on options for various environments you can use)
Eg, if your Role is arn:aws:iam::123456789012:role/PsoxyProvisioningRole
, the following should work:
To provision AWS infra, you'll need the aws-cli
installed and authenticated on the environment where you'll run terraform
.
Here are a few options:
Generate an AWS Access Key for your AWS User.
Run aws configure
in a terminal on the machine you plan to use, and configure it with the key you generated in step one.
NOTE: this could even be a GCP Cloud Shell, which may simplify auth if your wish to connect your Psoxy instance to Google Workspace as a data source.
If your organization prefers NOT to authorize the AWS CLI on individual laptops and/or outside AWS, provisioning Psoxy's required infra from an EC2 instance may be an option.
provision an EC2 instance (or request that your IT/dev ops team provision one for you). We recommend a micro instance with an 8GB disk, running ubuntu
(not Amazon Linux; if you choose that or something else, you may need to adapt these instructions). Be sure to create a PEM key to access it via SSH (unless your AWS Organization/account provides some other ssh solution).
associate the Role above with your instance (see https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html)
Whichever environment you choose, follow general prereq installation, and, when ready, continue with README.
You'll also need a backend location for your Terraform state (such as an S3 bucket). It can be in any AWS account, as long as the AWS role that you'll use to run Terraform has read/write access to it.
See https://developer.hashicorp.com/terraform/language/settings/backends/s3 for more details.
Alternatively, you may use a local file system, but this is not recommended for production use - as your Terraform state may contain secrets such as API keys, depending on the sources you connect.
See https://developer.hashicorp.com/terraform/language/settings/backends/local
The module psoxy-constants is a dependency-free module that provides lists of AWS managed policies, etc needed for bootstraping a AWS account in which your proxy instances will reside.
Once you've fulfilled the prereqs, including having your terraform deployment environment, backend, and AWS account prepared, we suggest you use our AWS example template repo:
https://github.com/Worklytics/psoxy-example-aws
Follow the 'Usage' instructions there to continue.
If you use Psoxy to send pseudonymized data to Worklytics and later wish to re-identify the data that you export from Worklytics to your premises, you'll need a lookup table in your data warehouse to JOIN with that data.
Populating this variable will generate another version of your HRIS data (aside from the one exposed to Worklytics) which you can then import back to your data warehouse.
To enable it, add the following to your terraform.tfvars
file:
In sanitized_accessor_role_names
, add the name of whatever AWS role that the principal running ingestion of your lookup table from S3 to your data warehouse will assume. You can add additional role names as needed. Alternatively, you can use an IAM policy created outside of our Terraform module to grant access to the lookup table CSVs within the S3 bucket.
After you apply this configuration, the lookup table will be generated in an S3 bucket. The S3 bucket will be shown in the Terraform output:
Use the bucket name shown in your output to build import pipeline to your data warehouse.
Every time a new hris snapshot is uploaded to the hris -input
bucket, TWO copies of it will be created: a sanitized copy in the bucket accessible Worklytics, and the lookup variant in the lookup bucket referenced above (not accessible to Worklytics).
The lookup table CSV file will have the following columns: EMPLOYEE_ID,EMPLOYEE_ID_ORIG
If you load this into your Data Warehouse, you can JOIN it with the data you export from Worklytics.
Then the following query will give re-identified aggregate data:
The employeeId
column in the result set will be the original employee ID from your HRIS system.
If your HRIS employee ID column is considered PII, then the lookup table and any re-identified data exports you use it to produce should be handled as Personal data, according to your policies, as these now reference readily identifiable Natural Persons.
If you wish limit re-identification to a subset of your data, you can use additional columns present in your HRIS csv to do so, for example:
Within the lookup_table_builders
map, you can specify the following fields:
input_connector_id
- usually hris
; this corresponds the whatever bulk connector you want to build the lookup table for.
rules
- this follows the rules structure for the bulk connector case. The example above is suited for HRIS data following the schema expected by Worklytics. If you modify this, be sure to review our documentation or contact support to ensure you don't break your lookup table.
beta - we're not committed that maintaining this under versioning policy; minor proxy iterations may require changes to privileges required in the least-privileged role.
This is a guide about how to create a role for provisioning psoxy infrastructure in AWS, following the principle of least-privilege at permission-level, rather than policy-level.
Eg, as of v0.4.55 of the proxy, our docs provide guidance on using an AWS role to provision your psoxy infrastructure using the least-privileged set of AWS managed policies possible. A stronger standard would be to use a custom IAM policy rather than AWS managed policy, with the least-privileged set of permissions required.
Additionally, you can specify resource constraints to improve security within a shared AWS account. (However, we do not recommend or officially support deployment into a shared AWS account. We recommend deploying your proxy instances in isolated AWS account to provide an implicit security boundary by default, as an additional layer of protection beyond those provided by our proxy modules)
We provide an example IAM policy document in our psoxy-constants
module that you can use to create a IAM policy in AWS. You can do this outside terraform, finding the JSON from that policy OR via terraform as follows:
This page provides an overview of how proxy authenticates and confirms authorization of clients (Worklytics tenants).
Each Worklytics tenant operates as a unique GCP service account within Google Cloud. GCP issues an identity token for this service account to processes running in the tenant, which the tenant then uses to authenticate against AWS.
No secrets or keys need to be exchanged between Worklytics and your AWS instance. The integrity of the authentication is provided by the signature of the identity token provided by GCP, which AWS verifies against Google's public certificates.
Annotating the diagram for the above case, with specific components for Worklytics-->Proxy case:
In the above, the AWS resource you're allowing access to is AWS IAM role, which your Worklytics tenant assumes and then can access S3 or invoke Lambda function.
Within your AWS account, you create an IAM role, with a role assumption policy that allows your Worklytics tenant's GCP Service Account (identified by a numeric ID you obtain from the Worklytics portal) to assume the role.
This assumption policy will have a statement similar to the following, where the value of the aud
claim is the numeric ID of your Worklytics tenant's GCP Service Account:
Colloquially, this allows a web identity federated from accounts.google.com
where Google has asserted the claim that aud
== 12345678901234567890123456789
to assume the role.
Then you use this AWS IAM role as the principal in AWS IAM policies you define to authorize to invoke your proxy instances via their function URLs (API connectors) or to read from their sanitized output buckets (bulk data connectors)
See: https://github.com/Worklytics/psoxy/blob/v0.4.40/infra/modules/aws/main.tf#L81-L102
Some organizations require use of API Gateway. This is not the default approach for Psoxy since AWS added support for Lambda Function URLs (March 2022), which are a simpler and more direct way to expose lambdas via HTTPS.
Nonetheless, should you wish to use API Gateway we provide beta support for this. It is needed if you wish to put your Lambda functions on a VPC (See lambdas-on-vpc.md
).
In particular:
IAM policy that allows api gateway methods to be invoked by the proxy caller role is defined once, using wildcards, and exposes GET/HEAD/POST methods for all resources. While methods are further constrained by routes and the proxy rules themselves, this could be another enforcement point at the infrastructure level - at expense of N policies + attachments in your terraform plan instead of 1.
proxy instances exposed as lambda function urls have 55s timeout, but API Gateway seems to support 30s as max - so this may cause timeouts in certain APIs
Prerequisites:
Add the following to your terraform.tfvars
file:
Then terraform apply
should create of API gateway-related resources, including policies/etc; and destroy lambda function urls (if you've previously applied with use_api_gateway=false
, which is the default).
If you wish to use API Gateway V1, you will not be able to use the flag above. Instead, you'll have to do something like the following:
Additionally, you'll need to set a different handler class to be invoked instead of the default (co.workltyics.psoxy.Handler
, should be co.worklytics.psoxy.APIGatewayV1Handler
). This can be done in Terraform or by modifying configuration via AWS Console.
Tips and tricks for using AWS as to host the proxy.
If above doesn't happen seem to work as expected, some ideas in the next section may help.
Options:
find credentials output by your SSO helper (eg, aws-okta
) then fill the AWS CLI env variables yourself:
if your SSO helper fills default AWS credentials file but simply doesn't set the env vars, you may be able to export the profile to AWS_PROFILE
, eg
References: https://discuss.hashicorp.com/t/using-credential-created-by-aws-sso-for-terraform/23075/7
Options:
Log into AWS web console
navigate to the AWS account that hosts your proxy instance (you may need to assume a role in that account)
then the region in that account in which your proxy instance is deployed. (default us-east-1
)
then search or navigate to the AWS Lambda
s feature, and find the specific one you wish to debug
find the tabs for Monitoring
then within that, Logging
, then click "go to Cloud Watch"
Unless your AWS CLI is auth'd as a user who can review logs, first auth it for such a role.
You can do this with a new profile, or setting env variables as follows:
Then, you can do a series of commands as follows:
Something like the following:
Your Terraform state is inconsistent. Run something like the following, adapted for your connector:
NOTE: you likely need to change outlook-mail
if your error is with a different data source. The \
chars are needed to escape the double-quotes/brackets in your bash command.
Something like the following:
Check:
the SSM parameter exists in the AWS account
the SSM parameter can be decrypted by the lambda's execution role (if it's encrypted with a KMS key)
Setting IS_DEVELOPMENT_MODE
to "true" in the Lambda's Env Vars via the console can enable some additional logging with detailed SSM error messages that will be helpful; but note that some of these errors will be expected in certain configurations.
Our Terraform examples should provide both of the above for you, but worth double-checking.
If those are present, yet the error persists, it's possible that you have some org-level security constraint/policy preventing SSM parameters from being used / read. For example, you have a "default deny" policy set for SSM GET actions/etc. In such a case, you need to add the execute roles for each lambda as exceptions to such policies (find these under AWS --> IAM --> Roles).
Our aws-host
Terraform module, as used in our , provides a variable lookup_table_builders
to control generation of these lookup tables.
If your input file follows the for Worklytics, it will have SNAPSHOT,EMPLOYEE_ID,EMPLOYEE_EMAIL,JOIN_DATE,LEAVE_DATE,MANAGER_ID
columns, at minimum.
Eg, assuming you've exported the to your data warehouse, load the files from S3 bucket above into a table named lookup_hris
.
This is based identity federation (aka "web identity federation" or "workload identity federation").
AWS provides an overview of the specific GCP Case:
the AWS principal (user or role) to provision API gateways. The AWS managed policy provides this.
execute terraform via
execute terraform via
use a script such as to get short-lived key+secret for your user.
the SSM parameter can be read by the lambda's execution rule (eg, has an attached IAM policy that allows the SSM parameter to be read; can test this with the , setting 'Role' to your lambda's execution role, 'Service' to 'AWS Systems Manager', 'Action' to 'Get Parameter' and 'Resource' to the SSM parameter's ARN.