Create a Dataset

A dataset can be created by sending a POST request to /v2/datasources. We support three types of datasets:

A PDF or Excel file that can be uploaded to the One Creation platform.
Connecting to a supported data location, which will allow us to automatically discover the metadata.
Aggregating two or more existing datasets together, which will enable us to virtually merge them.

This guide details how to utilize our API to create any of these datasets.

Request Body

The request body contains two parameters as shown in the table below. If you are uploading a file as a dataset, please populate the "file" parameter. Otherwise, you would only need the "request" parameter to connect to a data location or aggregate existing datasets.

Key	Type	Description
request	application/json	Required. Contains connection information to your data location (database or filesystem)
file	multipart/form-data	Only required if you are attaching a file to be protected

Common Fields

Here is the list of required fields within the "request" parameter:

name

name of the dataset.

source

the type of the dataset. One of the following enumerated values:

- AWSS3
- DATABRICKS
- DREMIO
- EXCEL: as an attachment
- KX
- MYSQL
- ORACLE
- PDF: as an attachment 
- POSTGRESQL
- REDSHIFT
- SHAREPOINT
- SNOWFLAKE

reviewer

Optional. The email address of the reviewer

fields

The list of metadata/attributes to be used for setting up data sharing rules in the following format:

"fields":[
  {"level":0,"type":"folder","format":"","path":["table name"]},
  {"level":1,"type":"folder","format":"string","path":["table name","column1"]},
  {"level":1,"type":"folder","format":"string","path":["table name","column2"]},
  {"level":1,"type":"folder","format":"string","path":["table name","column3"]}]

E.g. assume an excel file or database table called "users" looks like this:

|name |gender |birthday |
|--------|-------- |-------- |
|Yang C |Female |7/19/2011 |
|Liat Z |Female |6/21/1993 |
|Kevin B |Male |2/1/1983 |

The value of "fields" will look like this:

"fields":[
  {"level":0,"type":"folder","format":"","path":["users"]},
  {"level":1,"type":"folder","format":"string","path":["users","name"]},
  {"level":1,"type":"folder","format":"string","path":["users","gender"]},
  {"level":1,"type":"folder","format":"string","path":["users","birthday"]}]

Connecting to a Dataset

The following fields are utilized to establish a connection to your respective dataset and utilize it to create your One Creation dataset. These are optional if you are creating a dataset manually or with aggregation.

datasourceConnection

Contains the lists of dataset specific fields:

Databricks

dbType: "databricks"
dbName: the name of the database
dbHost: the database host URL
dbPort: the database port number
dbUsername: the login username
dbPassword: the login password
dbTable: the name of the table/view

MySQL

dbType: "mysql"
dbName: the name of the database
dbHost: the database host URL
dbPort: the database port number
dbUsername: the login username
dbPassword: the login password
dbTable: the name of the table/view

Oracle

dbType: "oracle"
dbName: the name of the database
dbHost: the database host URL
dbPort: the database port number
dbUsername: the login username
dbPassword: the login password
dbSchema: the name of the database schema
dbTable: the name of the table/view

PostgreSQL

dbType: "postgresql"
dbName: the name of the database
dbHost: the database host URL
dbPort: the database port number
dbUsername: the login username
dbPassword: the login password
dbTable: the name of the table/view

Redshift

dbType: "redshift"
dbName: the name of the database
dbHost: the database host URL
dbPort: the database port number
dbUsername: the login username
dbPassword: the login password
dbTable: the name of the table/view

Snowflake

dbType: "snowflake"
dbName: the name of the database
dbHost: the database host URL
dbUsername: the login username
dbPassword: the login password
dbWarehouse: the name of the warehouse
dbRole: the login user role
dbSchema: the name of the database schema
dbTable: the name of the table/view

Dremio

dbType: "dremio"
dbSchema: the name of the database schema
dbHost: the database host URL
dbPort: the database port number
dbUsername: the login username
dbPassword: the login password
dbTable: the name of the table/view

AWS S3

awsAccessKeyId: the S3 bucket access key
awsSecretAccessKey: the secret key
bucketName: the name of the S3 bucket
region: the region of the S3 bucket. e.g. eu-central-1
templateName: the name of the template file with file extension. E.g. "data_template.xlsx"
filterPattern: optional field. This is the prefix of the data files. If the bucket contains multiple files with different names, we use this pattern prefix to find the data file for distribution.

OneDrive

tenantId: The tenant Id
secretValue The client secret key
applicationId The Id of the application
driveItemId The Id of the item you wish to utilize for the dataset
siteId The Id of the site

KX

dbType: "kx"
dbHost: IP of the database
dbPort: Port of the database
dbUsername: Username used for the database login
dbPassword: Password used for the database login
dbStoredProcedure: name of the stored procedure
dbStoredProcedureParameters: { "label":"", - The label of the stored procedure "type":"" - The type, either String, Number, Date, Time, Datetime, or boolean "length":"5", - Length of the value "value":"trade" - The value of the stored procedure }

Aggregation

If you wish to create a dataset that contains data from multiple connections or multiple files, you can create an aggregated dataset. In order to do so, you will need to create individual datasets seprately before aggregating them.

For example, if you wished to create a dataset that contains data from S3 and MySQL you will first individually create those datasets. Once those datasets are created, you will need to acquire their respective datasourceIds. Please see Get/Delete Dataset(s) to learn how to retrieve datasourceIds.

The fields used for aggregation are:

dataSourceIds - The ID of each dataset you wish to aggregate. This field is optional unless you are utilizing aggregation.
aggregationDatasourceKeys the attributes that you wish to aggregate between each dataset, typically the name of a column in a spreadsheet. This field is optional unless you are utilizing aggregation.

Manually Uploading a File

If you wish to manually upload a PDF or excel file as your dataset, the request still follows as detailed above. You must manually add a PDF or XLSX file as an attachment to the request. When manually uploading a file, datasourceConnection becomes optional and the source field must be either "XLSX" or "PDF".

Testing and Example

If you wish to test out these endpoints please refer to the corresponding Swagger page which will allow you to see the example request body and send requests to this endpoint. You can also use external software such as postman to test these endpoints as well. You can download our postman examples here.

Here's a sample "request" payload:

 {
   "name":"NAME",
   "source":"POSTGRESQL",
   "reviewer":"reviewer email",
   "fields":[
      {"level":0,"type":"folder","format":"","path":["table name"]},
      {"level":1,"type":"folder","format":"string","path":["table name","column1"]},
      {"level":1,"type":"folder","format":"string","path":["table name","column2"]},
      {"level":1,"type":"folder","format":"string","path":["table name","column3"]}],
  "datasourceConnection": {
    "dbType": "postgresql",
    "dbName": "database name",
    "dbHost": "database host url",
    "dbPort": "port number",
    "dbUsername": "username",
    "dbPassword": "password",
    "dbTable": "table name"
  }
}

Next Step

Once a dataset is created, you can proceed to Create a Policy.

Last update: May 18, 2023