添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

Load Data in JSON Format from Amazon S3 Using a Wildcard

SingleStore Pipelines can read data from JSON formatted files stored in an S3 bucket by using a wildcard (*) .

Prerequisites

The following prerequisites need to be in place before creating a pipeline using a JSON wildcard .

Two or more JSON files for upload to an S3 bucket .

Access to the S3 bucket .

Create a Table

Create a table using the following syntax:

CREATE TABLE employees (
lastname TEXT CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
firstname TEXT CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
age INT NOT NULL,
DOB TEXT CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
partner TEXT CHARACTER SET utf8 COLLATE utf8_general_ci,
hasChildren boolean,
children JSON COLLATE utf8_bin,
SHARD KEY ()
);

The JSON file contents are below . To test, cut and paste the file contents into any program that allows files to be saved in JSON format . Upload both files to the S3 bucket . For this example, the files names are emp1 and emp2 .

Contents of emp1 :

{
"lastname": "Bateman",
"firstname": "Patrick",
"DOB" : "October 23, 1982",
"age": "40",
"partner": "Sharon Bateman",
"hasChildren": true,
"children":["Isabell", "Jackson"]
}
{
"lastname": "Denbrough",
"firstname": "Bill",
"DOB" : "January 04, 1990",
"age": "33",
"partner": "Audra Phillips",
"hasChildren": true,
"children":["Stephen", "Maya", "Hunter"]
}
{
"lastname": "Torance",
"firstname": "Jack",
"DOB" : "June 20, 1980",
"age": "43",
"partner": "Wendy Torrance",
"hasChildren": true,
"children":["Danny", "Lucy"]
}
{
"lastname": "Strode",
"firstname": "Laurie",
"DOB" : "February 22, 1987",
"age": "36",
"partner": null,
"hasChildren": true,
"children":["Jamie Lloyd", "John Tate", "Karen Nelson"]
}
{
"lastname": "Whateley",
"firstname": "Wilbur",
"DOB" : "February 02, 1970",
"age": "53",
"partner": null,
"hasChildren": true,
"children":["Wilbur Jr.", "Eleanor"]
}

Contents of emp2 :

{
"lastname": "Karras",
"firstname": "Damien",
"DOB" : "April 12, 1973",
"age": "50",
"partner": null,
"hasChildren": false,
"children": null
}
{
"lastname": "Wilkes",
"firstname": "Annie",
"DOB" : "April 01, 1972",
"age": "51",
"partner":null,
"hasChildren": false,
"children": null
}
{
"lastname": "Cady",
"firstname": "Max",
"DOB" : "September 13, 1989",
"age": "33",
"partner": null,
"hasChildren": false,
"children": null
}
{
"lastname": "White",
"firstname": "Carrie",
"DOB" : "August 25, 2000",
"age": "22",
"partner": null,
"hasChildren": false,
"children": null
}
{
"lastname": "MacNeil",
"firstname": "Regan",
"DOB" : "April 07, 1985",
"age": "38",
"partner": "Orlando Knotts",
"hasChildren": false,
"children": null
}
{
"lastname": "Eriksson",
"firstname": "Oskar",
"DOB" : "May 29, 2001",
"age": "22",
"partner": null,
"hasChildren": false,
"children": null
}
{
"lastname": "Grevers",
"firstname": "Nick",
"DOB" : "November 21, 1995",
"age": "28",
"partner" : "Sam Avery",
"hasChildren": false,
"children": null
}

Create a Pipeline to the S3 Bucket

Make sure to add the correct pipeline name, bucket name, folder name (if applicable), and AWS credentials .

CREATE PIPELINE emp_data
AS LOAD DATA S3 '<bucket_name>/<folder_name>/*.json'
CONFIG '{"region":"us-west-2"}'
CREDENTIALS '{"aws_access_key_id": "<xxxxxxxxxxxxxxx>",
"aws_secret_access_key": "<xxxxxxxxxxxxxxx>"}'
BATCH_INTERVAL 2500
MAX_PARTITIONS_PER_BATCH 1
DISABLE OUT_OF_ORDER OPTIMIZATION
DISABLE OFFSETS METADATA GC
SKIP DUPLICATE KEY ERRORS
INTO TABLE employees
FORMAT JSON
(`lastname` <- `lastname` default '',
`firstname` <- `firstname` default '',
`age` <- `age` default -1,
`DOB` <- `DOB` default -1,
`partner` <- `partner` default '',
`hasChildren` <- `hasChildren` default '',
`children` <- `children` default '');

Start the pipeline .

START PIPELINE emp_data;

Verify the data has loaded . The contents of both JSON files should be loaded into the table .

SELECT * FROM employees;
+-----------+-----------+-----+--------------------+----------------+-------------+--------------------------------------------+
| lastname  | firstname | age | DOB                | partner        | hasChildren | children                                   |
+-----------+-----------+-----+--------------------+----------------+-------------+--------------------------------------------+
| Bateman   | Patrick   |  40 | October 23, 1982   | Sharon Bateman |           1 | ["Isabell","Jackson"]                      |
| Eriksson  | Oskar     |  22 | May 29, 2001       | NULL           |           0 | NULL                                       |
| MacNeil   | Regan     |  38 | April 07, 1985     | Orlando Knotts |           0 | NULL                                       |
| Grevers   | Nick      |  28 | November 21, 1995  | Sam Avery      |           0 | NULL                                       |
| Karras    | Damien    |  50 | April 12, 1973     | NULL           |           0 | NULL                                       |
| Whateley  | Wilbur    |  53 | February 02, 1970  | NULL           |           1 | ["Wilbur Jr.","Eleanor"]                   |
| Denbrough | Bill      |  33 | January 04, 1990   | Audra Phillips |           1 | ["Stephen","Maya","Hunter"]                |
| Torance   | Jack      |  43 | June 20, 1980      | Wendy Torrance |           1 | ["Danny","Lucy"]                           |
| Wilkes    | Annie     |  51 | April 01, 1972     | NULL           |           0 | NULL                                       |
| Cady      | Max       |  33 | September 13, 1989 | NULL           |           0 | NULL                                       |
| White     | Carrie    |  22 | August 25, 2000    | NULL           |           0 | NULL                                       |
| Strode    | Laurie    |  36 | February 22, 1987  | NULL           |           1 | ["Jamie Lloyd","John Tate","Karen Nelson"] |
+-----------+-----------+-----+--------------------+----------------+-------------+--------------------------------------------+

Other Useful Pipeline Commands

To troubleshoot an empty table run the command below be sure to add in the correct pipeline name .

SELECT * FROM information_schema.pipelines_errors
WHERE pipeline_name = '<pipeline_name>';

Stopping a pipeline is useful when you want to stop additional data from loading but would like to use the pipeline again . Make sure to include the correct pipeline name .

STOP PIPELINE <pipeline_name>;

Dropping a pipeline is useful when the pipeline was created for a one-off data load and it will not be used in the future . Once a pipeline is dropped it will need to be recreated in order to use it again .

DROP PIPELINE <pipeline_name>;

A pipeline does not need to be stopped in order to be dropped .