From the course: AWS Certified Data Engineer Associate (DEA-C01) Cert Prep

Loading data into Amazon Redshift

- [Instructor] We learned that you can insert data into a sequel table using the Insert command. However, with a data warehouse like Redshift, the amount of data to load can be quite large. In this lesson, we'll learn how to use the Copy command to efficiently load data into Redshift. The Copy command is the recommended and simplest way to load data from data sources into RedShift. It is optimized to load data in parallel from a Dynamo DB table, an S3 bucket, or even remote hosts over SSH. This shows the syntax of the Copy command. The table name is the name of the Redshift table that you are loading. It has to already exist, so you have to create it first if it is a new table. The data source is a pointer to the location of the data at the source. You have to supply credentials and have the permissions to read the data at the source. When you load your table directly from an Amazon Dynamo DB table, you have the option to control the amount of amazon Dynamo DB provision to throughput that you consume. So you can leave some for the other reads happening on your table. You control it by adding the read ratio parameter to the end of the copy statement. For example, read rate 50 means the copy operation will use up to 50% of the table's provisioned read capacity units. RedShift nodes are partitioned into slices, and each slice receives part of the memory and disc base from the node. The number and slices per node range from two for the smallest node size up to 16 for the largest. When you use the Copy command, Amazon RedShift allocates the workload to the cluster nodes and performs the load operations in parallel. The amount of files that can be uploaded at the same time is equal to the number of slices in the cluster. So a cluster with eight ra3.xplus nodes will contain 16 slices, and that is the number of parallel uploads that can occur. 128 node cluster of 16xlarge nodes will be able to upload 2,048 files simultaneously. This is important to keep in mind when you are loading a large amount of data contained in text files in s3 or remote servers. The Copy command loads data in parallel across all the slices in your compute notes. For best performance, when loading data, you should split up your files so that there are at least as many files as slices. If you have more files than slices, make sure that the number of files is a multiple of the number of slices that you have. Amazon RedShift does not take file size into account when loading your data, but as a best practice, split your input data into files of roughly equal size, but somewhere between one megabyte and one gigabyte after compression just to ensure that your files are similar in size to help load the data evenly among the slices. This diagram represents an 8xl compute node, which has 16 node slices. If your source data is contained in only one file, then only one of the slices is going to be used to load the data. By splitting up the larger file into multiple smaller files, you get better performance when loading data into the cluster, because all of the slices are going to be used. When you issue a Copy command, you are loading the data from one table. To use the parallel loading capability, the data should be split into smaller files, and again, ideally the number of slices. If all the data for the table is in an uncompressed delimited file, like a CFV file, or a pipe delimited file, then RedShift will automatically split the file to use all the slices for you. For compressed files, you should split them yourself, again, into roughly equal size between one megabyte and one gigabyte. You should also use a common prefix for each file, so you can just specify the prefix in the Copy command. This example shows one way to do this with data for an orders table, where the files are named orders.csv.1, orders.csv.2, et cetera. Now if you have a lot of files in your bucket, and you want to make sure only the correct ones get uploaded for your table, then you can add the location of the files into a manifest file, and add the manifest file in the copy command. The manifest file is a JSON formatted text file that lists each file to upload. Each file that you specify has to be a path to the file, including the Amazon S3 bucket, and the file prefix. The mandatory option specifies whether the load should automatically fail and error out if the file is missing. A manifest file is required when loading files from a remote host. You need to create the manifest file and store it in an S3 bucket. The entries in this manifest file are somewhat different, because you have to specify the details of the SSH connection, the host name, the user name, the public key, and the command issues to retrieve the data. The command could be as simple as a cat command to read the file. This manifest example is for loading files from an EC2 instance. The default format for the Copy command is text files delimited with a pipe character. But Avro, CSV, and JSON formats are also supported, as well as the columnar formats, ORC and Parquet. RedShift can also load files that are compressed with BZIP2, GZIP, LZOP, or Z standard compression formats. And this often helps speed up the file transfer, since the files are smaller. If you have any issues with loading data, there are three system tables that can help you troubleshoot. The load errors system table contains the error messages from specific loads. The file scan table lets you see the load times for specific files, and the S3 client error captures errors when RedShift is trying to access the data from S3. While you can specify primary key and foreign keys when defining your tables, these are only used as hints for RedShift. RedShift does not enforce unique key constraints. So if you try to load the same data twice, you're actually going to end up with duplicate records in your RedShift table.

Contents