Bulk Data

Table of Contents

1. Bulk Data Import Overview

For certain data types, Worklytics supports import of data from bulk files, either uploaded directly to Worklytics via our web portal or from a cloud storage location (such as Google Cloud Storage, AWS S3, etc). This document provides an overview of that process and the supported file formats for each data type.

The data flow is as follows:

  1. Export your data to a file (.csv)

  2. Upload the flat file directly to Worklytics (via our Web App) or to a cloud storage location that you've connected to Worklytics (an AWS S3 bucket, a Google Cloud Storage bucket, etc.)

  3. Worklytics parses the file and loads your data into its system (eg, updates our representation of your HRIS data accordingly, appends the survey results, etc.)

If you're using our sanitization/pseudonymization Psoxy, the second step of the flow will include an intermediary phase where the file is loaded into a storage location within your premises, from where our proxy (running in your premises) will apply pseudonymization and place the sanitized result into a second storage location (also in your premises). You may either download the sanitized file from that location to be uploaded via our Web App, or may connect Worklytics directly to the location of the sanitized file (preferred).

2. File Naming Requirements

  • File names should use only [A-Za-z0-9_.-] characters. Do not use whitespace or special characters, to avoid transfer problems to/from various cloud storage providers (*).

  • Suffixes of files after last . are expected to be a file format (eg, .csv).

  • Last 8 characters prior to the last . are expected to be a date in YYYYMMDD format (eg ISO) if applicable for the file type. This is expected to be the effective date of the file, although the semantics of how this value is interpreted may vary by import type

(*) File naming considerations: Google Cloud Storage and Amazon Simple Storage Service.

3. CSV Formatting Requirements

For data in Comma-Separated Values format (.csv or CSV), you must follow the formatting requirements and conventions described in this section.

We to parse CSV as specified in RFC 4180, except as specified here (and in later sections for each particular data type: HRIS, Surveys, etc).

In particular, each file MUST include a header row as the first line of the file, specifying column names (this is NOT required by RFC 4180). This header line is interpreted as follows:

  • column names are interpreted as case-insensitive (EMPLOYEE_ID is equivalent to employee_id)

  • leading/trailing whitespace is trimmed (ignored) (EMPLOYEE_ID is equivalent to EMPLOYEE_ID)

  • column names MAY be enclosed in double quotes ("EMPLOYEE_ID" is equivalent to EMPLOYEE_ID)

  • column names containing commas MUST be enclosed in double quotes

The ordering of columns within a file does not matter and need not be consistent across files.

The following tables summarizes values.

NOTES:

  • Any value MAY be enclosed with matching double-quotes ("). If so, and the value itself contains a double-quote, it MUST be escaped by preceding it with another double-quote. ("aaa","b""bb","ccc")

  • Any value that contains a comma (,) MUST be enclosed in double-quotes ("). Eg, a row intended to contain the value Smith, John in the second column must be formatted as valueA,"Smith, John",valueC

  • Do NOT mix DATE formats within a single file type, as potentially ambiguous. Use only one or the other and set the format on the connection. We prefer ISO, specifically yyyy-MM-dd (eg, 2022-12-09) as the most readable and unambiguous.

Entity Identifier Format Consistency

Any identifier, such as employee identifiers (EMPLOYEE_ID,MANAGER_ID, etc) MUST be formatted consistently across ALL data sources and import files for which you intend them to match, except for leading/trailing whitespace. Eg, Identifiers that refer to the same entity (person) MUST be _byte-wise_** equivalent** after trimming any leading/trailing whitespace.

Eg, 0123 will NOT be match to 123, abc will NOT match to ABC, etc.

Email addresses are a special case, where pseudonyms are generated based on canonicalization of domain/mailbox based on typical email address semantics. For example, these will be handled as case-insensitive, . in mailbox portion ignored, etc. Eg, alicedoe@acme.com will result in same pseudonym as alice.doe@acme.com, Alice.Doe@Acme.com, etc.

For example:

File 1: HRIS Import

SNAPSHOT,EMPLOYEE_ID,EMPLOYEE_EMAIL,JOIN_DATE,LEAVE_DATE,OFFICE_TZ,MANAGER_ID,ROLE,TEAM
2022-10-10,E-001,karen@example.net,2021-06-15,,Europe/London,     ,Senior Sales Rep,Sales

File 2: Survey data

SNAPSHOT,EMPLOYEE_ID,"I feel valued at Acme.com"
2022-04-01,e1,7,9

As one EMPLOYEE_ID is E-001 and e1, Worklytics will NOT match these rows as references to the same individual.

Please verify that all sources you intend to connect to Worklytics provide employee ids in the same format.

4. Compression

Worklytics supports ingesting of files compressed with GZIP. We strongly recommend you utilize this to improve performance and reduce cost.

There are two supported ways to indicated that your file is compressed:

  1. Set the Content-Encoding metadata on the storage object to gzip. This is the most standards compliant approach and plays well with native GCS/S3 tooling; so it is our preferred method.

  2. Append the .gz suffix to the file name (DEPRECATED; may be removed from January 30, 2025)

Last updated