AMAZON Redshift(1)Introduction
Python is well used here with SQL.
Normal SQL
select regex_replace(url, ‘(https?)://([^@]*@)?([^:/]*)([/:].*)$)’, ‘\3’) FROM table;
===>
Python and SQL
create function f_hostname(url VARCHAR) returns archer Immutable as
$$ import url parse.urlparse(url).hostname $$
LANGUAGE plpython;
select f_hostname(url) FROM table;
NumPy SciPy: math tool
Pandas: SQL operation on top of SciPy and NumPy
Dateutil and Pytz: Date and Timezone
http://www.numpy.org/
http://scipy.org/about.html
http://pandas.pydata.org/
https://dateutil.readthedocs.org/en/latest/
https://pypi.python.org/pypi/pytz/
Data Warehouse System Architecture
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/c_high_level_system_architecture.html
Industry-standard PostgreSQL JDBC and ODBC driver.
Leader node —> compile codes and distribute the compiled code to the compute nodes, assigns a portion of the data to each compute node
Compute nodes —> 160 GB node
Load data from S3 into Redshift
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/t_Loading-data-from-S3.html
Copy Command to Load the Data
copy <table_name> from ‘s3://<bucket_name>/<object_prefix>'
credentials ‘<aws-auth-args>’;
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/t_loading-tables-from-s3.html
http://docs.aws.amazon.com/zh_cn/datapipeline/latest/DeveloperGuide/dp-copydata-redshift.html
Work on the DB
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/t_deleting_redshift_user_cmd.html
How to Design the Table
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/c_designing-tables-best-practices.html
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/t_Creating_tables.html
How to Load Data
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/c_loading-data-best-practices.html
How to Query Data
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/c_designing-queries-best-practices.html
DataBase Admin’s Command
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/t_querying_redshift_system_tables.html
Table Design
If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. - timestamp
If you do frequent range filtering or equality filtering on one column, specify that column as the sort key. - range or equality
If you frequently join a table, specify the join column as both the sort key and the distribution key.
References:
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/c_redshift_system_overview.html
https://aws.amazon.com/cn/documentation/redshift/
AMAZON Redshift(1)Introduction
猜你喜欢
转载自sillycat.iteye.com/blog/2293025
今日推荐
周排行