Azure Synapse Analytics is a cloud-based analytics service that is used for data warehousing, big data processing, and machine learning. The platform allows for efficient data processing and analysis by providing a variety of distribution types to accommodate various workloads.
The distribution type refers to the way data is stored and distributed across the nodes in a cluster. The distribution type can affect performance, scalability, and cost. Azure Synapse Analytics supports three distribution types: Round Robin, Replicate and Hash.
Round Robin
Round-robin distribution is a distribution type in which data is distributed evenly across all nodes in the system. In Azure Synapse Analytics, this is typically achieved by assigning each row of data to a different node in a circular fashion.
Consider a table with the following data:
ID | Name | Age |
---|---|---|
1 | Harry | 20 |
2 | Hermione | 21 |
3 | Ron | 22 |
4 | Albus | 70 |
5 | Severus | 65 |
If we use round-robin distribution to distribute this data across three nodes, the data would be distributed as follows:
As you can see, the data is distributed evenly across all three nodes.
Round-robin distribution is ideal for tables that do not have a column with a high degree of uniqueness. This is because when using round-robin distribution, data is not grouped or sorted in any particular way. If the data in your table has no natural grouping or sorting order, round-robin distribution can be a good choice.
To create a table with round-robin distribution in Azure Synapse Analytics, you can specify the ROUND_ROBIN distribution option when creating or altering a table. For example:
CREATE TABLE stg.dimEmployee ( ID INT, Name VARCHAR(50), Age INT ) WITH ( DISTRIBUTION = ROUND_ROBIN );
By using the ROUND_ROBIN distribution option, data in the stg.dimEmployee table will be distributed evenly across all nodes in the system in a round-robin fashion.
Replicate Distribution
Replicate distribution is a distribution type in which data is replicated across all nodes in the system. This means that each node contains a complete copy of the data.
Replicate distribution is useful in situations where you need to frequently join or aggregate data across multiple tables. By replicating the data across all nodes, you can reduce the amount of data movement required for these operations.
However, note that replicate distribution is not ideal for tables with a large amount of data or tables with columns that are frequently updated. This is because replicating the data across all nodes can consume a large amount of storage space and can result in slower update performance.
To use replicate distribution in Azure Synapse Analytics, you can specify the REPLICATE distribution option when creating or altering a table. For example:
CREATE TABLE dimEmployee ( ID INT, Name VARCHAR(50), Age INT ) WITH ( DISTRIBUTION = REPLICATE );
By using the REPLICATE distribution option, data in the dimEmployee table will be replicated across all compute nodes in the system.
This can be useful for frequently joining or aggregating data across multiple tables, but may not be ideal for tables with a large amount of data or frequently updated columns.
Suppose we have a table called dimEmployee
with the following data:
ID | Name | Age |
---|---|---|
1 | Harry | 20 |
2 | Hermione | 21 |
3 | Ron | 22 |
4 | Albus | 70 |
5 | Severus | 65 |
If we use replicate distribution to distribute, each compute node would contain a complete copy of the data:
As you can see, each compute node contains a complete copy of the data. This can be useful for frequently joining or aggregating data across multiple tables, as it reduces the amount of data movement required for these operations. However, note that replicate distribution is not ideal for tables with a large amount of data or tables with columns that are frequently updated, as it can consume a large amount of storage space and result in slower update performance.
Hash Distribution
Hash distribution is a distribution type in which data is distributed based on the value of a specified column. In Azure Synapse Analytics, the system calculates a hash value for the specified column, and uses this value to determine which node the data should be sent to.
Hash distribution is ideal for tables with a column with a high degree of uniqueness. When using hash distribution, data is grouped together based on the value of the specified column. If the column has a high degree of uniqueness, the data will be evenly distributed across all nodes in the system.
To create a table with hash distribution in Azure Synapse Analytics, you can specify the HASH distribution option when creating or altering a table, along with the name of the column to use for distribution.
For example, suppose you have a table called FactOrder
with the following columns:
BrandID | ProductID | OrderDate | Total |
---|---|---|---|
1 | P1 | 2022-01-01 | 100.00 |
1 | P2 | 2022-01-02 | 200.00 |
1 | P3 | 2022-01-03 | 300.00 |
1 | P4 | 2022-01-04 | 400.00 |
1 | P5 | 2022-01-05 | 500.00 |
And suppose you want to distribute this table across two nodes in the system based on the CustomerID
column. You can create the table with hash distribution as follows:
CREATE TABLE FactOrder ( BrandID INT, ProductID INT, OrderDate DATE, Total DECIMAL(10,2) ) WITH ( DISTRIBUTION = HASH(ProductID) );
By using the HASH(CustomerID) distribution option, data in the FactOrder
table will be distributed across all nodes in the system based on the value of the ProductID
column.
In summary, hash distribution is a useful distribution type in Azure Synapse Analytics that allows you to distribute data across multiple nodes based on the value of a specified column. It is ideal for tables with a column with a high degree of uniqueness, as it groups data together based on the value of the specified column.
Multi Column Hash Distribution
Hash distribution allows you to group data together based on the value of the specified column, and distribute it across multiple nodes in the system.
However, in many cases, a single column may not be sufficient to distribute data effectively. For example, if you have a large table with many rows and multiple columns, distributing the data based on a single column may result in uneven data distribution and poor query performance.
To address this issue, Azure Synapse Analytics now supports multi-column hash distribution. With multi-column hash distribution, you can specify multiple columns to use for distribution, and the system will calculate a hash value based on the combined values of these columns.
For example, suppose you have a table called FactOrder
with the following columns:
BrandID | ProductID | OrderDate | Total |
---|---|---|---|
1 | P1 | 2022-01-01 | 100.00 |
2 | P2 | 2022-01-02 | 200.00 |
3 | P3 | 2022-01-03 | 300.00 |
4 | P4 | 2022-01-04 | 400.00 |
5 | P5 | 2022-01-05 | 500.00 |
6 | P6 | 2022-01-06 | 600.00 |
Suppose you want to distribute this table across two nodes in the system based on the BrandID
and ProductID
columns. You can create the table with multi-column hash distribution as follows:
CREATE TABLE FactOrder ( BrandID INT, ProductID INT, OrderDate DATE, Total DECIMAL(10,2) ) WITH ( DISTRIBUTION = HASH(BrandID, ProductID) );
By using the HASH(BrandID, ProductID) distribution option, data in the FactOrder
table will be distributed across all nodes in the system based on the combined values of the BrandID
and ProductID
columns. This can result in more even data distribution and better query performance.
To summarize, multi-column hash distribution is a powerful feature in Azure Synapse Analytics that allows you to distribute data across multiple nodes based on the combined values of multiple columns. This can result in more even data distribution and better query performance, particularly for large tables with many columns.
To learn more about distributions and other features of Azure Synapse Analytics, check out the official documentation.
Add comment