Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Required:
Optional:
AWS SAM CLI (macOS) for local testing, if desired
awscurl for direct testing of deployed AWS lambda from a terminal
Maven build produces a zip file.
Build core library
From java/impl/aws/
:
Locally, you can test function's behavior from invocation on a JSON payload (but not how the API gateway will map HTTP requests to that JSON payload):
https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-using-invoke.html
We recommend deploying your Psoxy code into AWS using the terraform modules found in [infra/modules/
](../../infra/modules/] for AWS. These modules both provision the required AWS infrastructure, as well as deploying the built binaries for Psoxy as lambdas in the target account.
Example configurations using those modules can be found in `infra/examples/.
You'll ultimately provision infrastructure represented in green in the following diagram:
![AWS data flow](./2022-02 Psoxy Data Flow.png)
See infra/modules/aws/
for more information.
YMMV; as of June 2023, AWS's 1GB limit on cloud shell persistent storage is too low for real world proxy deployments, which typically require install gcloud CLI / Azure CLI to connect to sources
So use use your local machine, or a VM/container elsewhere in AWS (EC2, AWS Cloud9, etc
clone the repo
add the following lines to your ~/.bashrc
. (AWS Cloud Shell preserves only your HOME directory across sessions, so add any commands that modify/install things outside to your .bashrc
)
Then source ~/.bashrc
, to execute the above.
install Terraform
if using Google Workspace data sources, install Google Cloud CLI and authenticate.
if using Microsoft 365 data sources, install Azure CLI and authenticate.
https://docs.microsoft.com/en-us/cli/azure/install-azure-cli
You should now be ready for the general instructions in the README.md.
If default NodeJS tooling doesn't work for you, legacy testing tools use python/awscurl, installed via pip. See example below:
As of June 2023, the following resources provisioned by Psoxy modules support use of CMEKs:
Lambda function environment variables
SSM Parameters
Cloud Watch Log Groups
S3 Buckets
The psoxy-example-aws
example provides a project_aws_key_arn
variable, that, if provided, will be set as the encryption key for these resources. A few caveats:
The AWS principal your Terraform is running as must have permissions to encrypt/decrypt with the key (it needs to be able to read/write the lambda env, ssm params, etc)
The key should be in the same AWS region you're deploying to.
CloudWatch must be able to use the key, as described in AWS CloudWatch docs
In example-dev/aws-all/kms-cmek.tf
, we provide a bunch of lines that you can uncomment to use encryption on S3 and properly set key policy to support S3/CloudWatch use.
For production use, you should adapt the key policy to your environment and scope as needed to follow your security policies, such as principle of least privilege.
If you need more granular control of CMEK by resource type, review the main.tf
and variables exposed by the aws-host
module for some options.
This page provides an overview of how proxy authenticates and confirms authorization of clients (Worklytics tenants).
Each Worklytics tenant operates as a unique GCP service account within Google Cloud. GCP issues an identity token for this service account to processes running in the tenant, which the tenant then uses to authenticate against AWS.
This is OIDC based identity federation (aka "web identity federation" or "workload identity federation").
No secrets or keys need to be exchanged between Worklytics and your AWS instance. The integrity of the authentication is provided by the signature of the identity token provided by GCP, which AWS verifies against Google's public certificates.
AWS provides an overview of the specific GCP Case: Access AWS using a Google Cloud Platform native workload identity
Annotating the diagram for the above case, with specific components for Worklytics-->Proxy case:
In the above, the AWS resource you're allowing access to is AWS IAM role, which your Worklytics tenant assumes and then can access S3 or invoke Lambda function.
Within your AWS account, you create an IAM role, with a role assumption policy that allows your Worklytics tenant's GCP Service Account (identified by a numeric ID you obtain from the Worklytics portal) to assume the role.
This assumption policy will have a statement similar to the following, where the value of the aud
claim is the numeric ID of your Worklytics tenant's GCP Service Account:
Colloquially, this allows a web identity federated from accounts.google.com
where Google has asserted the claim that aud
== 12345678901234567890123456789
to assume the role.
Then you use this AWS IAM role as the principal in AWS IAM policies you define to authorize to invoke your proxy instances via their function URLs (API connectors) or to read from their sanitized output buckets (bulk data connectors)
See: https://github.com/Worklytics/psoxy/blob/v0.4.40/infra/modules/aws/main.tf#L81-L102
You'll provision the following to host Psoxy in AWS:
Lambda Functions
IAM Roles and Policies
System Manager Parameter Store Parameters
Cloud Watch Log Groups
S3 buckets, if using the 'bulk' mode to sanitize file data (such as CSVs)
Cognito Pools and Identities, if connecting to Microsoft 365 (Azure AD) data sources
The diagram below provides an architecture overview of the 'REST' and 'Bulk' use-cases.
An AWS Account in which to deploy Psoxy We strongly recommend you provision one specifically for use to host Psoxy, as this will create an implicit security boundary, reduce possible conflicts with other infra configured in the account, and simplify eventual cleanup.
You will need the numeric AWS Account ID for this account, which you can find in the AWS Console.
If your AWS organization enforces Service Control Policies, ensure that these are allow the AWS components required by Psoxy or exempt the AWS Account in which you will deploy Psoxy from these policies.
If your organization uses any sort of security control enforcement mechanism, you may have disable/provide exceptions to those controls for you initial deployment. Then generally those controls can be implemented later by extending our examples. Our protips page provides some guidance on how to extend the base examples to meet more extreme requirements.
A sufficiently privileged AWS Role You must have a IAM Role within the AWS account with sufficient privileges to (AWS managed policy examples linked):
create IAM roles + policies (eg IAMFullAccess)
create and update Systems Manager Parameters (eg, AmazonSSMFullAccess )
create and manage Lambdas (eg AWSLambda_FullAccess )
create and manage S3 buckets (eg AmazonS3FullAccess )
create Cloud Watch Log groups (eg CloudWatchFullAccess)
(Yes, the use of AWS Managed Policies results in a role with many privileges; that's why we recommend you use a dedicated AWS account to host proxy which is NOT shared with any other use case)
You will need the ARN of this role.
NOTE: if you're connecting to Microsoft 365 (Azure AD) data sources, you'll also need permissions to create AWS Cognito Identity Pools and add Identities to them, such as arn:aws:iam::aws:policy/AmazonCognitoPowerUser. Some AWS Organizations have Service Control Policies in place that deny this by default, even if you have an IAM role that allows it at an account level.
An authenticated AWS CLI in your provisioning environment. Your environment (eg, shell/etc from which you'll run terraform commands) must be authenticated as an identity that can assume that role. (see next section for tips on options for various environments you can use)
Eg, if your Role is arn:aws:iam::123456789012:role/PsoxyProvisioningRole
, the following should work:
To provision AWS infra, you'll need the aws-cli
installed and authenticated on the environment where you'll run terraform
.
Here are a few options:
Generate an AWS Access Key for your AWS User.
Run aws configure
in a terminal on the machine you plan to use, and configure it with the key you generated in step one.
NOTE: this could even be a GCP Cloud Shell, which may simplify auth if your wish to connect your Psoxy instance to Google Workspace as a data source.
If your organization prefers NOT to authorize the AWS CLI on individual laptops and/or outside AWS, provisioning Psoxy's required infra from an EC2 instance may be an option.
provision an EC2 instance (or request that your IT/dev ops team provision one for you). We recommend a micro instance with an 8GB disk, running ubuntu
(not Amazon Linux; if you choose that or something else, you may need to adapt these instructions). Be sure to create a PEM key to access it via SSH (unless your AWS Organization/account provides some other ssh solution).
associate the Role above with your instance (see https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html)
Whichever environment you choose, follow general prereq installation, and, when ready, continue with README.
You'll also need a backend location for your Terraform state (such as an S3 bucket). It can be in any AWS account, as long as the AWS role that you'll use to run Terraform has read/write access to it.
See https://developer.hashicorp.com/terraform/language/settings/backends/s3 for more details.
Alternatively, you may use a local file system, but this is not recommended for production use - as your Terraform state may contain secrets such as API keys, depending on the sources you connect.
See https://developer.hashicorp.com/terraform/language/settings/backends/local
The module psoxy-constants is a dependency-free module that provides lists of AWS managed policies, etc needed for bootstraping a AWS account in which your proxy instances will reside.
Once you've fulfilled the prereqs, including having your terraform deployment environment, backend, and AWS account prepared, we suggest you use our AWS example template repo:
https://github.com/Worklytics/psoxy-example-aws
Follow the 'Usage' instructions there to continue.
If you use psoxy to send pseudonymized data to Worklytics and later wish to re-identify the data that you export from Worklytics on your premises, you'll need a lookup table in your data warehouse to JOIN to data exported from Worklytics.
The S3 bucket in which this table, if any, will be shown as a Terraform output. eg
Use the bucket name shown in your output to build import pipeline to your data warehouse.
Some organizations require use of API Gateway. This is not the default approach for Psoxy since AWS added support for Lambda Function URLs (March 2022), which are a simpler and more direct way to expose lambdas via HTTPS.
Nonetheless, should you wish to use API Gateway we provide beta support for this. It is needed if you wish to put your Lambda functions on a VPC (See lambdas-on-vpc.md
).
In particular:
IAM policy that allows api gateway methods to be invoked by the proxy caller role is defined once, using wildcards, and exposes GET/HEAD/POST methods for all resources. While methods are further constrained by routes and the proxy rules themselves, this could be another enforcement point at the infrastructure level - at expense of N policies + attachments in your terraform plan instead of 1.
proxy instances exposed as lambda function urls have 55s timeout, but API Gateway seems to support 30s as max - so this may cause timeouts in certain APIs
Prerequisites:
Add the following to your terraform.tfvars
file:
Then terraform apply
should create of API gateway-related resources, including policies/etc; and destroy lambda function urls (if you've previously applied with use_api_gateway=false
, which is the default).
If you wish to use API Gateway V1, you will not be able to use the flag above. Instead, you'll have to do something like the following:
Additionally, you'll need to set a different handler class to be invoked instead of the default (co.workltyics.psoxy.Handler
, should be co.worklytics.psoxy.APIGatewayV1Handler
). This can be done in Terraform or by modifying configuration via AWS Console.
Some ideas on how to support scenarios and configuration requirements beyond what our default examples show:
If you're using our AWS example, it should support a default_tags
variable.
You can add the following in your terrform.tfvars
file to set tags on all resources created by the example configuration:
If you're not using our AWS example, you can add the following to your configuration, then you will need to modify the aws
provider block in your configuration to add a default_tags
. Example shown below:
See: [https://registry.terraform.io/providers/hashicorp/aws/latest/docs#default_tags]
To support extensibility, our Terraform examples/modules output the IDs/names of the major resources they create, so that you can compose them with other Terraform resources.
The aws-host
module outputs bulk_connector_instances
; a map of id => instance
for each bulk connector. Each of these has two attributes that correspond to the names of its related buckets:
sanitized_bucket_name
input_bucket_name
So in our AWS example, you can use these to enable logging, for example, you could do something like this: (YMMV, syntax etc should be tested)
Analogous approaches can be used to configure versioning, replication, etc;
Note that encryption, lifecycle, public_access_block are set by the Workltyics-provided modules, so you may have conflicts issues if you also try to set those outside.
beta - released from v0.4.50; YMMV, and may be subject to change.
The terraform modules we provide provision execution roles for each lambda function, and attach by default attach the appropriate AWS Managed Policy to each.
For organizations that don't allow use of AWS Managed Policies, you can use the aws_lambda_execution_role_policy_arn
variable to pass in an alternative which will be used INSTEAD of the AWS Managed Policy.
beta - This is now available for customer-use, but may still change in backwards incompatible ways.
Some caveats:
VPC must allow any API connector to connect to data source APIs via HTTPS (eg 443); usually these APIs are on the public internet, so this means egress to public internet.
VPC must allow your API gateway to connect to your lambdas.
The requirements above MAY require you to modify your VPC configuration, and/or the security groups to support proxy deployment.
vpc.tf
Prequisites:
the AWS principal (user or role) you're using to run Terraform must have permissions to manage VPCs, subnets, and security groups. The AWS managed policy AmazonVPCFullAccess
provides this.
NOTE: if you provide vpc_config
, the value you pass for use_api_gateway_v2
will be ignored; using a VPC requires API Gateway v2.
Add the following to "psoxy" module in your main.tf
(or uncomment if already present):
Uncomment the relevant lines in vpc.tf
in the same directory, and modify as you wish. This file pulls the default VPC/subnet/security group for your AWS account under terraform.
Alternatively, you modify vpc.tf
to use a provision non-default VPC/subnet/security group, and reference those from your main.tf
- subject to the caveats above.
See the following terraform resources that you'll likely need:
Check your Cloud Watch logs for the lambda. Proxy lambda will time out in INIT phase if SSM Parameter Store or your secret store implementation (AWS Secrets Manager, Vault) is not reachable.
Some potential causes of this:
DNS failure - it's going to look up the SSM service by domain; if the DNS zone for the SSM endpoint you've provisioned is not published on the VPC, this will fail; similarly, if the endpoint wasn't configured on a subnet - then it won't have an IP to be resolved.
if the IP is resolved, you should see failure to connect to it in the logs (timeouts); check that your security groups for lambda/subnet/endpoint allow bidirectional traffic necessary for your lambda to retrieve data from SSM via the REST API.
Terraform with aws provider doesn't seem to play nice with lambdas/subnets; the subnet can't be destroyed w/o destroying the lambda, but terraform seems unaware of this and will just wait forever.
So:
destroy all your lambdas (terraform state list | grep aws_lambda_function
; then terraform destroy --target=
for each, remember '' as needed)
destroy the subnet terraform destroy --target=aws_subnet.main
https://docs.aws.amazon.com/lambda/latest/dg/foundation-networking.html
https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc.html
In our various examples (see ), we provide a variable lookup_table_builder
that allows you to control generation of these lookup tables. Populating this variable will generate another version of your HRIS data (aside from the one exposed to Worklytics) which you can then import back to your data warehouse.
the AWS principal (user or role) to provision API gateways. The AWS managed policy provides this.
see
Specifically, this is , unless you're using a VPC - in which case it is AWSLambdaVPCAccessExecutionRole
(https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AWSLambdaVPCAccessExecutionRole.html).
Our aws-host
module provides a vpc_config
variable to specify the VPC configuration for the lambdas that our Terraform modules will create, analogous to the block supported by the AWS lambda terraform resource.
API connectors on a VPC must be exposed via rather than
VPC must be configured such that your lambda has connectivity to AWS services including S3, SSM, and CloudWatch Logs; this is typically done by adding a for each service.
all pre-requisites for the api-gateways (see )
Tips and tricks for using AWS as to host the proxy.
If above doesn't happen seem to work as expected, some ideas in the next section may help.
Options:
execute terraform via AWS Cloud Shell
find credentials output by your SSO helper (eg, aws-okta
) then fill the AWS CLI env variables yourself:
if your SSO helper fills default AWS credentials file but simply doesn't set the env vars, you may be able to export the profile to AWS_PROFILE
, eg
References: https://discuss.hashicorp.com/t/using-credential-created-by-aws-sso-for-terraform/23075/7
Options:
execute terraform via AWS Cloud Shell
use a script such as aws-mfa to get short-lived key+secret for your user.
Log into AWS web console
navigate to the AWS account that hosts your proxy instance (you may need to assume a role in that account)
then the region in that account in which your proxy instance is deployed. (default us-east-1
)
then search or navigate to the AWS Lambda
s feature, and find the specific one you wish to debug
find the tabs for Monitoring
then within that, Logging
, then click "go to Cloud Watch"
Unless your AWS CLI is auth'd as a user who can review logs, first auth it for such a role.
You can do this with a new profile, or setting env variables as follows:
Then, you can do a series of commands as follows:
Something like the following:
Your Terraform state is inconsistent. Run something like the following, adapted for your connector:
NOTE: you likely need to change outlook-mail
if your error is with a different data source. The \
chars are needed to escape the double-quotes/brackets in your bash command.
Something like the following:
Check:
the SSM parameter exists in the AWS account
the SSM parameter can be read by the lambda's execution rule (eg, has an attached IAM policy that allows the SSM parameter to be read; can test this with the AWS Policy Simulator, setting 'Role' to your lambda's execution role, 'Service' to 'AWS Systems Manager', 'Action' to 'Get Parameter' and 'Resource' to the SSM parameter's ARN.
the SSM parameter can be decrypted by the lambda's execution role (if it's encrypted with a KMS key)
Setting IS_DEVELOPMENT_MODE
to "true" in the Lambda's Env Vars via the console can enable some additional logging with detailed SSM error messages that will be helpful; but note that some of these errors will be expected in certain configurations.
Our Terraform examples should provide both of the above for you, but worth double-checking.
If those are present, yet the error persists, it's possible that you have some org-level security constraint/policy preventing SSM parameters from being used / read. For example, you have a "default deny" policy set for SSM GET actions/etc. In such a case, you need to add the execute roles for each lambda as exceptions to such policies (find these under AWS --> IAM --> Roles).