Load Data in JSON Format from Amazon S3 Using a Wildcard
On this page
SingleStore Pipelines can read data from JSON formatted files stored in an S3 bucket by using a wildcard (*)
.
Prerequisites
The following prerequisites need to be in place before creating a pipeline using a JSON wildcard
.
Two or more JSON files for upload to an S3 bucket
.
Access to the S3 bucket
.
Create a Table
Create a table using the following syntax:
CREATE TABLE employees (lastname TEXT CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,firstname TEXT CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,age INT NOT NULL,DOB TEXT CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,partner TEXT CHARACTER SET utf8 COLLATE utf8_general_ci,hasChildren boolean,children JSON COLLATE utf8_bin,SHARD KEY ());
The JSON file contents are below
.
Contents of emp1 :
{"lastname": "Bateman","firstname": "Patrick","DOB" : "October 23, 1982","age": "40","partner": "Sharon Bateman","hasChildren": true,"children":["Isabell", "Jackson"]}{"lastname": "Denbrough","firstname": "Bill","DOB" : "January 04, 1990","age": "33","partner": "Audra Phillips","hasChildren": true,"children":["Stephen", "Maya", "Hunter"]}{"lastname": "Torance","firstname": "Jack","DOB" : "June 20, 1980","age": "43","partner": "Wendy Torrance","hasChildren": true,"children":["Danny", "Lucy"]}{"lastname": "Strode","firstname": "Laurie","DOB" : "February 22, 1987","age": "36","partner": null,"hasChildren": true,"children":["Jamie Lloyd", "John Tate", "Karen Nelson"]}{"lastname": "Whateley","firstname": "Wilbur","DOB" : "February 02, 1970","age": "53","partner": null,"hasChildren": true,"children":["Wilbur Jr.", "Eleanor"]}
Contents of emp2 :
{"lastname": "Karras","firstname": "Damien","DOB" : "April 12, 1973","age": "50","partner": null,"hasChildren": false,"children": null}{"lastname": "Wilkes","firstname": "Annie","DOB" : "April 01, 1972","age": "51","partner":null,"hasChildren": false,"children": null}{"lastname": "Cady","firstname": "Max","DOB" : "September 13, 1989","age": "33","partner": null,"hasChildren": false,"children": null}{"lastname": "White","firstname": "Carrie","DOB" : "August 25, 2000","age": "22","partner": null,"hasChildren": false,"children": null}{"lastname": "MacNeil","firstname": "Regan","DOB" : "April 07, 1985","age": "38","partner": "Orlando Knotts","hasChildren": false,"children": null}{"lastname": "Eriksson","firstname": "Oskar","DOB" : "May 29, 2001","age": "22","partner": null,"hasChildren": false,"children": null}{"lastname": "Grevers","firstname": "Nick","DOB" : "November 21, 1995","age": "28","partner" : "Sam Avery","hasChildren": false,"children": null}
Create a Pipeline to the S3 Bucket
Make sure to add the correct pipeline name, bucket name, folder name (if applicable), and AWS credentials
.
CREATE PIPELINE emp_dataAS LOAD DATA S3 '<bucket_name>/<folder_name>/*.json'CONFIG '{"region":"us-west-2"}'CREDENTIALS '{"aws_access_key_id": "<xxxxxxxxxxxxxxx>","aws_secret_access_key": "<xxxxxxxxxxxxxxx>"}'BATCH_INTERVAL 2500MAX_PARTITIONS_PER_BATCH 1DISABLE OUT_OF_ORDER OPTIMIZATIONDISABLE OFFSETS METADATA GCSKIP DUPLICATE KEY ERRORSINTO TABLE employeesFORMAT JSON(`lastname` <- `lastname` default '',`firstname` <- `firstname` default '',`age` <- `age` default -1,`DOB` <- `DOB` default -1,`partner` <- `partner` default '',`hasChildren` <- `hasChildren` default '',`children` <- `children` default '');
Start the pipeline
.
START PIPELINE emp_data;
Verify the data has loaded
.
SELECT * FROM employees;
+-----------+-----------+-----+--------------------+----------------+-------------+--------------------------------------------+
| lastname | firstname | age | DOB | partner | hasChildren | children |
+-----------+-----------+-----+--------------------+----------------+-------------+--------------------------------------------+
| Bateman | Patrick | 40 | October 23, 1982 | Sharon Bateman | 1 | ["Isabell","Jackson"] |
| Eriksson | Oskar | 22 | May 29, 2001 | NULL | 0 | NULL |
| MacNeil | Regan | 38 | April 07, 1985 | Orlando Knotts | 0 | NULL |
| Grevers | Nick | 28 | November 21, 1995 | Sam Avery | 0 | NULL |
| Karras | Damien | 50 | April 12, 1973 | NULL | 0 | NULL |
| Whateley | Wilbur | 53 | February 02, 1970 | NULL | 1 | ["Wilbur Jr.","Eleanor"] |
| Denbrough | Bill | 33 | January 04, 1990 | Audra Phillips | 1 | ["Stephen","Maya","Hunter"] |
| Torance | Jack | 43 | June 20, 1980 | Wendy Torrance | 1 | ["Danny","Lucy"] |
| Wilkes | Annie | 51 | April 01, 1972 | NULL | 0 | NULL |
| Cady | Max | 33 | September 13, 1989 | NULL | 0 | NULL |
| White | Carrie | 22 | August 25, 2000 | NULL | 0 | NULL |
| Strode | Laurie | 36 | February 22, 1987 | NULL | 1 | ["Jamie Lloyd","John Tate","Karen Nelson"] |
+-----------+-----------+-----+--------------------+----------------+-------------+--------------------------------------------+
Other Useful Pipeline Commands
To troubleshoot an empty table run the command below be sure to add in the correct pipeline name
.
SELECT * FROM information_schema.pipelines_errorsWHERE pipeline_name = '<pipeline_name>';
Stopping a pipeline is useful when you want to stop additional data from loading but would like to use the pipeline again
.
STOP PIPELINE <pipeline_name>;
Dropping a pipeline is useful when the pipeline was created for a one-off data load and it will not be used in the future
.
DROP PIPELINE <pipeline_name>;
A pipeline does not need to be stopped in order to be dropped
.