Convert Word document to PPT document with WPS

[Introduction] First install WPS, and then run the code.

[Example screenshot]

[Core code]

wps2pdf

├── Program.cs
├── Properties
│ └── AssemblyInfo.cs
├── Wps2Pdf.cs
├── wps2pdf.csproj
└── wps2pdf.sln

1 directory, 5 files

File: n459.com/file/25127180-478966306

The following are irrelevant:

-------------------------------------------Dividing line----- ----------------------------------------

The naming method of Spark SQL tables is db_name.table_name, with only the database name and the data table name. If db_name is not specified and table_name is directly referenced, it is actually a table under the default database. In Spark SQL, the database only specifies the storage path of the table file. Each table can use different file formats to store data. From this point of view, the database can be regarded as the upper directory of the Databricks table for organizing data Table and its files.

In the python language environment, you can use %sql to switch to SQL command mode:

%sql
One, database
Commonly used database commands, switch the current database, display database list, table list, view list and column information:

use db_name
show databases
show tables [in db_name]
show views [in db_name]
show columns in db_name.table_name
1,创建数据库

Create a database, specify the location of the database file storage through LOCATION:

CREATE {DATABASE | SCHEMA} [IF NOT EXISTS] database_name
[LOCATION database_directory]
LOCATION database_directory: Specify the path to store the database file system. If the path does not exist in the underlying file system, you need to create the directory first. If the LOCATION parameter is not specified, the default data warehouse directory is used to create the database. The default data warehouse directory is specified by the static configuration parameter spark.sql.warehouse.dir.

2. View the description of the database

{DESC | DESCRIBE} DATABASE [EXTENDED] db_name
extended option means to view the extended attributes of the database.

3. Delete the database

DROP {DATABASE | SCHEMA} [IF EXISTS] dbname [RESTRICT | CASCADE]
IF EXISTS: This option means that when the database does not exist, the DROP operation will not cause an exception.
RESTRICT: This option means that non-empty databases cannot be deleted, and it is enabled by default.
CASCADE: This option means to delete all associated tables and functions in the database.

Second, create a table
Table has two scopes: global and local. Global tables can be referenced in all Clusters, while local tables can only be referenced in local Clusters, which are called temporary views. Users can populate tables from files in DBFS or data stored in any supported data source.

When creating a table, you need to specify the file format for storing the table data and the location where the table data file is stored.

1. Use a data source to create a table (standard CREATE TABLE command)

The syntax for creating a table, note: If a table with the same name already exists in the database, an exception will be thrown.

复制代码
CREATE TABLE [ IF NOT EXISTS ] [db_name].table_name
[ ( col_name1 col_type1, … ) ]
USING data_source
[ OPTIONS ( key1=val1, key2=val2, … ) ]
[ PARTITIONED BY ( col_name1, col_name2, … ) ]
[ CLUSTERED BY ( col_name3, col_name4, … )
[ SORTED BY ( col_name [ ASC | DESC ], … ) ]
INTO num_buckets BUCKETS ]
[ LOCATION path ]
[ AS select_statement ]
复制代码
参数注释:

IF NOT EXISTS: If a table with the same name already exists in the database, no operation will be performed.
USING data_source: the file format used for the table, data_source must be one of TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE, DELTA or LIBSVM, or a custom implementation of org.apache.spark.sql.sources.DataSourceRegister The fully qualified class name. Supports the use of HIVE to create Hive SerDe tables. You can use the OPTIONS clause to specify Hive-specific file_format and row_format, which are case-insensitive string mappings. The option keys are FILEFORMAT, INPUTFORMAT, OUTPUTFORMAT, SERDE, FIELDDELIM, ESCAPEDELIM, MAPKEYDELIM and LINEDELIM.
OPTIONS: Used to optimize the behavior of the table or configure the table options of the HIVE table.
PARTITIONED BY (col_name1, col_name2, …): partition the created table according to the specified column, and create a directory for each partition.
CLUSTERED BY col_name3, col_name4, …): According to the specified column, the partition in the table is divided into a fixed number of buckets. This option is usually used in conjunction with the partition operation. Files in delta format do not support this clause.
SORTED BY: The sorting method of data in buckets, the default is ascending ASC.
INTO num_buckets BUCKETS: Bucket is an optimization technology that uses buckets (and bucket columns) to determine the partition of the data, and avoids data shuffle, so that the data becomes orderly.
LOCATION path: The directory used to store the table data, you can specify the path on the distributed storage.
AS select_statement: Populate the table with output data from the SELECT statement.
2. Use Delta Lake (incremental Lake) to create a table

Users can use the standard CREATE TABLE command to create a table stored in the delta lake. In addition to the standard command to create a delta table, you can also use the following syntax to create a delta table:

CREATE [OR REPLACE] TABLE table_identifier[(col_name1 col_type1 [NOT NULL], …)]
USING DELTA
[LOCATION]
table_identifier has two formats:

[database_name.] table_name: the name of the table
delta delta_file_path.: Create the table on the specified path without creating an entry in the metastore.
LOCATION: If the specified LOCATION already contains the data stored in the incremental lake, Delta lake will perform the following operations:

If only the table name and location are specified, for example:

CREATE TABLE events
USING DELTA
LOCATION'/mnt/delta/events'
The tables in the Hive metastore will automatically inherit the schema, partition and table attributes of the existing data. This function can be used to "import" data into the metastore .

If you specify any configuration (schema, partition, or table attribute), Delta Lake will verify that the specified content exactly matches the configuration of the existing data. If the specified configuration does not exactly match the configuration of the data, Delta Lake will raise an exception describing the difference.

3. Example of creating a table

Copy code
--Use data source
CREATE TABLE student (id INT, name STRING, age INT) USING PARQUET;

–Use data from another table
CREATE TABLE student_copy USING PARQUET
AS SELECT * FROM student;

–Omit the USING clause, which uses the default data source (parquet by default)
CREATE TABLE student (id INT, name STRING, age INT);

--Create partitioned and bucketed table
CREATE TABLE student (id INT, name STRING, age INT)
USING PARQUET
PARTITIONED BY (age)
CLUSTERED BY (Id) INTO 4 buckets;
copy the code
Three,
the role of the interactive data source table is similar to the data source For the pointer to the underlying data source, for example, you can use the JDBC data source to create the table foo in Azure Databricks, which points to the table bar in MySQL. When reading and writing table foo, it is actually reading and writing table bar.

Normally, CREATE TABLE creates a "pointer" and must ensure that the object it points to exists. One exception is file sources, such as Parquet, JSON. If you do not specify the LOCATION option, Azure Databricks will create a default table location.

For CREATE TABLE AS SELECT, Azure Databricks uses the output data of the select query to overwrite the underlying data source to ensure that the created table contains exactly the same data as the input query.

Fourth, insert data
Users can insert data into the table or into files supported by Spark.

1. Insert data into the table

Using the INSERT INTO command to append data to the table will not affect the existing data in the table; using the INSERT OVERWRITE command will overwrite the existing data in the table.

INSERT INTO [ TABLE ] table_identifier [ partition_spec ]
{ VALUES ( { value | NULL } [ , … ] ) [ , ( … ) ] | query }

INSERT OVERWRITE [ TABLE ] table_identifier [ partition_spec [ IF NOT EXISTS ] ]
{ VALUES ( { value | NULL } [ , … ] ) [ , ( … ) ] | query }
参数注释:

table_identifier: [database_name.] table_name: table name, you can choose to use the database name to qualify. delta.<path to table>: the location of the existing delta table.
partition_spec: An optional parameter that specifies a comma-separated list of key/value pairs for partitions. Syntax: PARTITION (partition_col_name = partition_col_val [,…])
Value ({value |NULL} [,…]) [, (…) ]: The value to be inserted. Explicitly specified value or NULL. Use a comma to separate each value in the clause. You can specify multiple value sets to insert multiple rows.
query: The query that generates the row to be inserted. Available query formats: SELECT statement, TABLE statement, FROM statement.
For example, after creating a table, insert a small amount of values ​​into the table through the VALUES clause, or through the SELECT clause, TABLE and FROM insert data into the table in batches.

复制代码
CREATE TABLE students (name VARCHAR(64), address VARCHAR(64), student_id INT)
USING PARQUET PARTITIONED BY (student_id);

INSERT INTO students VALUES
(‘Amy Smith’, ‘123 Park Ave, San Jose’, 111111);

INSERT INTO students VALUES
(‘Bob Brown’, ‘456 Taylor St, Cupertino’, 222222),
(‘Cathy Johnson’, ‘789 Race Ave, Palo Alto’, 333333);

INSERT INTO students PARTITION (student_id = 444444)
SELECT name, address FROM persons WHERE name = “Dora Williams”;

INSERT INTO students TABLE visiting_students;

INSERT INTO students
FROM applicants SELECT name, address, id applicants WHERE qualified = true;
copy code
2, insert data into the file

When inserting data into a file, you can only overwrite existing data with new data:

INSERT OVERWRITE [ LOCAL ] DIRECTORY [ directory_path ]
USING file_format [ OPTIONS ( key = val [ , … ] ) ]
{ VALUES ( { value | NULL } [ , … ] ) [ , ( … ) ] | query }
参数注释:

directory_path: target directory, you can also use OPTIONS path specified in. The LOCAL keyword is used to specify that the directory is located in the local file system.
file_format: The file format to be used for insertion. Valid options include TEXT, CSV, JSON, JDBC, PARQUET, ORC HIVE LIBSVM, or the fully qualified class name org.apache.spark.sql.execution.datasources.FileFormat.
OPTIONS (key = val [,...] ): Specify one or more options for writing the file format.

Guess you like

Origin blog.csdn.net/gumenghua_com1/article/details/112559586