Create a Dataset
A dataset can be created by sending a POST request to /v2/datasources. We support three types of datasets:
- A PDF or Excel file that can be uploaded to the One Creation platform.
- Connecting to a supported data location, which will allow us to automatically discover the metadata.
- Aggregating two or more existing datasets together, which will enable us to virtually merge them.
This guide details how to utilize our API to create any of these datasets.
Request Body
The request body contains two parameters as shown in the table below. If you are uploading a file as a dataset, please populate the "file" parameter. Otherwise, you would only need the "request" parameter to connect to a data location or aggregate existing datasets.
Key | Type | Description |
---|---|---|
request | application/json | Required. Contains connection information to your data location (database or filesystem) |
file | multipart/form-data | Only required if you are attaching a file to be protected |
Common Fields
Here is the list of required fields within the "request" parameter:
name
name of the dataset.
source
the type of the dataset. One of the following enumerated values:
- AWSS3
- DATABRICKS
- DREMIO
- EXCEL: as an attachment
- KX
- MYSQL
- ORACLE
- PDF: as an attachment
- POSTGRESQL
- REDSHIFT
- SHAREPOINT
- SNOWFLAKE
reviewer
Optional. The email address of the reviewer
fields
The list of metadata/attributes to be used for setting up data sharing rules in the following format:
"fields":[
{"level":0,"type":"folder","format":"","path":["table name"]},
{"level":1,"type":"folder","format":"string","path":["table name","column1"]},
{"level":1,"type":"folder","format":"string","path":["table name","column2"]},
{"level":1,"type":"folder","format":"string","path":["table name","column3"]}]
E.g. assume an excel file or database table called "users" looks like this:
|name |gender |birthday |
|--------|-------- |-------- |
|Yang C |Female |7/19/2011 |
|Liat Z |Female |6/21/1993 |
|Kevin B |Male |2/1/1983 |
The value of "fields" will look like this:
"fields":[
{"level":0,"type":"folder","format":"","path":["users"]},
{"level":1,"type":"folder","format":"string","path":["users","name"]},
{"level":1,"type":"folder","format":"string","path":["users","gender"]},
{"level":1,"type":"folder","format":"string","path":["users","birthday"]}]
Connecting to a Dataset
The following fields are utilized to establish a connection to your respective dataset and utilize it to create your One Creation dataset. These are optional if you are creating a dataset manually or with aggregation.
datasourceConnection
Contains the lists of dataset specific fields:
Databricks
- dbType: "databricks"
- dbName: the name of the database
- dbHost: the database host URL
- dbPort: the database port number
- dbUsername: the login username
- dbPassword: the login password
- dbTable: the name of the table/view
MySQL
- dbType: "mysql"
- dbName: the name of the database
- dbHost: the database host URL
- dbPort: the database port number
- dbUsername: the login username
- dbPassword: the login password
- dbTable: the name of the table/view
Oracle
- dbType: "oracle"
- dbName: the name of the database
- dbHost: the database host URL
- dbPort: the database port number
- dbUsername: the login username
- dbPassword: the login password
- dbSchema: the name of the database schema
- dbTable: the name of the table/view
PostgreSQL
- dbType: "postgresql"
- dbName: the name of the database
- dbHost: the database host URL
- dbPort: the database port number
- dbUsername: the login username
- dbPassword: the login password
- dbTable: the name of the table/view
Redshift
- dbType: "redshift"
- dbName: the name of the database
- dbHost: the database host URL
- dbPort: the database port number
- dbUsername: the login username
- dbPassword: the login password
- dbTable: the name of the table/view
Snowflake
- dbType: "snowflake"
- dbName: the name of the database
- dbHost: the database host URL
- dbUsername: the login username
- dbPassword: the login password
- dbWarehouse: the name of the warehouse
- dbRole: the login user role
- dbSchema: the name of the database schema
- dbTable: the name of the table/view
Dremio
- dbType: "dremio"
- dbSchema: the name of the database schema
- dbHost: the database host URL
- dbPort: the database port number
- dbUsername: the login username
- dbPassword: the login password
- dbTable: the name of the table/view
AWS S3
- awsAccessKeyId: the S3 bucket access key
- awsSecretAccessKey: the secret key
- bucketName: the name of the S3 bucket
- region: the region of the S3 bucket. e.g. eu-central-1
- templateName: the name of the template file with file extension. E.g. "data_template.xlsx"
- filterPattern: optional field. This is the prefix of the data files. If the bucket contains multiple files with different names, we use this pattern prefix to find the data file for distribution.
OneDrive
- tenantId: The tenant Id
- secretValue The client secret key
- applicationId The Id of the application
- driveItemId The Id of the item you wish to utilize for the dataset
- siteId The Id of the site
KX
- dbType: "kx"
- dbHost: IP of the database
- dbPort: Port of the database
- dbUsername: Username used for the database login
- dbPassword: Password used for the database login
- dbStoredProcedure: name of the stored procedure
- dbStoredProcedureParameters: { "label":"", - The label of the stored procedure "type":"" - The type, either String, Number, Date, Time, Datetime, or boolean "length":"5", - Length of the value "value":"trade" - The value of the stored procedure }
Aggregation
If you wish to create a dataset that contains data from multiple connections or multiple files, you can create an aggregated dataset. In order to do so, you will need to create individual datasets seprately before aggregating them.
For example, if you wished to create a dataset that contains data from S3 and MySQL you will first individually create those datasets. Once those datasets are created, you will need to acquire their respective datasourceIds. Please see Get/Delete Dataset(s) to learn how to retrieve datasourceIds.
The fields used for aggregation are:
-
dataSourceIds - The ID of each dataset you wish to aggregate. This field is optional unless you are utilizing aggregation.
-
aggregationDatasourceKeys the attributes that you wish to aggregate between each dataset, typically the name of a column in a spreadsheet. This field is optional unless you are utilizing aggregation.
Manually Uploading a File
If you wish to manually upload a PDF or excel file as your dataset, the request still follows as detailed above. You must manually add a PDF or XLSX file as an attachment to the request. When manually uploading a file, datasourceConnection becomes optional and the source field must be either "XLSX" or "PDF".
Testing and Example
If you wish to test out these endpoints please refer to the corresponding Swagger page which will allow you to see the example request body and send requests to this endpoint. You can also use external software such as postman to test these endpoints as well. You can download our postman examples here.
Here's a sample "request" payload:
{
"name":"NAME",
"source":"POSTGRESQL",
"reviewer":"reviewer email",
"fields":[
{"level":0,"type":"folder","format":"","path":["table name"]},
{"level":1,"type":"folder","format":"string","path":["table name","column1"]},
{"level":1,"type":"folder","format":"string","path":["table name","column2"]},
{"level":1,"type":"folder","format":"string","path":["table name","column3"]}],
"datasourceConnection": {
"dbType": "postgresql",
"dbName": "database name",
"dbHost": "database host url",
"dbPort": "port number",
"dbUsername": "username",
"dbPassword": "password",
"dbTable": "table name"
}
}
Next Step
Once a dataset is created, you can proceed to Create a Policy.