Bulk Data
Table of Contents
1. Bulk Data Import Overview
For certain data types, Worklytics supports import of data from bulk files, either uploaded directly to Worklytics via our web portal or from a cloud storage location (such as Google Cloud Storage, AWS S3, etc). This document provides an overview of that process and the supported file formats for each data type.
The data flow is as follows:
Export your data to a file (
.csv
)Upload the flat file directly to Worklytics (via our Web App) or to a cloud storage location that you've connected to Worklytics (an AWS S3 bucket, a Google Cloud Storage bucket, etc.)
Worklytics parses the file and loads your data into its system (eg, updates our representation of your HRIS data accordingly, appends the survey results, etc.)
If you're using our sanitization/pseudonymization Psoxy, the second step of the flow will include an intermediary phase where the file is loaded into a storage location within your premises, from where our proxy (running in your premises) will apply pseudonymization and place the sanitized result into a second storage location (also in your premises). You may either download the sanitized file from that location to be uploaded via our Web App, or may connect Worklytics directly to the location of the sanitized file (preferred).
2. File Naming Requirements
File names should use only
[A-Za-z0-9_.-]
characters. Do not use whitespace or special characters, to avoid transfer problems to/from various cloud storage providers (*).Suffixes of files after last
.
are expected to be a file format (eg,.csv
).Last 8 characters prior to the last
.
are expected to be a date inYYYYMMDD
format (eg ISO) if applicable for the file type. This is expected to be the effective date of the file, although the semantics of how this value is interpreted may vary by import type
(*) File naming considerations: Google Cloud Storage and Amazon Simple Storage Service.
3. CSV Formatting Requirements
For data in Comma-Separated Values format (.csv
or CSV), you must follow the formatting requirements and conventions described in this section.
We to parse CSV as specified in RFC 4180, except as specified here (and in later sections for each particular data type: HRIS, Surveys, etc).
In particular, each file MUST include a header row as the first line of the file, specifying column names (this is NOT required by RFC 4180). This header line is interpreted as follows:
column names are interpreted as case-insensitive (
EMPLOYEE_ID
is equivalent toemployee_id
)leading/trailing whitespace is trimmed (ignored) (
EMPLOYEE_ID
is equivalent toEMPLOYEE_ID
)column names MAY be enclosed in double quotes (
"EMPLOYEE_ID"
is equivalent toEMPLOYEE_ID
)column names containing commas MUST be enclosed in double quotes
The ordering of columns within a file does not matter and need not be consistent across files.
The following tables summarizes values.
TYPE | VALUES ACCEPTED |
---|---|
STRING | Any UTF-8 character. |
BOOLEAN | "TRUE", any other value parsed as false |
FLOAT | Numerical values |
TIME_OF_DAY | HH:MM, 24H format |
DATE |
|
| |
DATETIME | ISO Instant format UTC: |
NOTES:
Any value MAY be enclosed with matching double-quotes (
"
). If so, and the value itself contains a double-quote, it MUST be escaped by preceding it with another double-quote. ("aaa","b""bb","ccc"
)Any value that contains a comma (
,
) MUST be enclosed in double-quotes ("
). Eg, a row intended to contain the valueSmith, John
in the second column must be formatted asvalueA,"Smith, John",valueC
Do NOT mix
DATE
formats within a single file type, as potentially ambiguous. Use only one or the other and set the format on the connection. We preferISO
, specificallyyyyy-MM-dd
(eg,2022-12-09
) as the most readable and unambiguous.
Entity Identifier Format Consistency
Any identifier, such as employee identifiers (EMPLOYEE_ID
,MANAGER_ID
, etc) MUST be formatted consistently across ALL data sources and import files for which you intend them to match, except for leading/trailing whitespace. Eg, Identifiers that refer to the same entity (person) MUST be _byte-wise_** equivalent** after trimming any leading/trailing whitespace.
Eg, 0123
will NOT be match to 123
, abc
will NOT match to ABC
, etc.
Email addresses are a special case, where pseudonyms are generated based on canonicalization of domain/mailbox based on typical email address semantics. For example, these will be handled as case-insensitive, .
in mailbox portion ignored, etc. Eg, alicedoe@acme.com
will result in same pseudonym as alice.doe@acme.com
, Alice.Doe@Acme.com
, etc.
For example:
File 1: HRIS Import
File 2: Survey data
As one EMPLOYEE_ID
is E-001
and e1
, Worklytics will NOT match these rows as references to the same individual.
Please verify that all sources you intend to connect to Worklytics provide employee ids in the same format.
4. Compression
Worklytics supports ingesting of files compressed with GZIP. We strongly recommend you utilize this to improve performance and reduce cost.
There are two supported ways to indicated that your file is compressed:
Set the
Content-Encoding
metadata on the storage object togzip
. This is the most standards compliant approach and plays well with native GCS/S3 tooling; so it is our preferred method.Append the
.gz
suffix to the file name (DEPRECATED; may be removed from January 30, 2025)
Last updated