AMAZON Redshift(1)Introduction

AMAZON Redshift(1)Introduction

Python is well used here with SQL.
Normal SQL
select regex_replace(url, ‘(https?)://([^@]*@)?([^:/]*)([/:].*)$)’, ‘\3’) FROM table;

===>
Python and SQL
create function f_hostname(url VARCHAR) returns archer Immutable as
$$ import url parse.urlparse(url).hostname $$
LANGUAGE plpython;

select f_hostname(url) FROM table;

NumPy SciPy: math tool
Pandas: SQL operation on top of SciPy and NumPy
Dateutil and Pytz: Date and Timezone

http://www.numpy.org/

http://scipy.org/about.html

http://pandas.pydata.org/

https://dateutil.readthedocs.org/en/latest/

https://pypi.python.org/pypi/pytz/

Data Warehouse System Architecture
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/c_high_level_system_architecture.html

Industry-standard PostgreSQL JDBC and ODBC driver.

Leader node —> compile codes and distribute the compiled code to the compute nodes, assigns a portion of the data to each compute node

Compute nodes —> 160 GB node

Load data from S3 into Redshift
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/t_Loading-data-from-S3.html

Copy Command to Load the Data
copy <table_name> from ‘s3://<bucket_name>/<object_prefix>'
credentials ‘<aws-auth-args>’;
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/t_loading-tables-from-s3.html

http://docs.aws.amazon.com/zh_cn/datapipeline/latest/DeveloperGuide/dp-copydata-redshift.html

Work on the DB
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/t_deleting_redshift_user_cmd.html

How to Design the Table
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/c_designing-tables-best-practices.html

http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/t_Creating_tables.html

How to Load Data
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/c_loading-data-best-practices.html

How to Query Data
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/c_designing-queries-best-practices.html

DataBase Admin’s Command
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/t_querying_redshift_system_tables.html

Table Design
If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. - timestamp

If you do frequent range filtering or equality filtering on one column, specify that column as the sort key. - range or equality

If you frequently join a table, specify the join column as both the sort key and the distribution key.

References:
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/c_redshift_system_overview.html

https://aws.amazon.com/cn/documentation/redshift/

猜你喜欢

转载自sillycat.iteye.com/blog/2293025