What are some simple and easy-to-use domestic databases?

The open source data processing languages ​​based on JVM mainly include Kotlin, Scala, and SPL. The following is a horizontal comparison of the three to find out the most efficient data processing language. The applicable scenarios of this article are set to common data processing and business logic in project development, mainly structured data, big data and high performance are not the focus, and do not involve special scenarios such as message flow and scientific computing.

Basic Features

adaptation surface

The original intention of Kotlin's design is to develop Java with higher efficiency, which can be applied to any application scenarios involving Java. In addition to common information management systems, it can also be used for WebServer, Android projects, and game development, with good versatility. Scala is originally designed to be a general-purpose development language that integrates modern programming paradigms. In practice, it is mainly used for back-end big data processing. It rarely appears in other types of projects, and it is not as versatile as Kotlin. The original design of SPL is a professional data processing language. The practice is consistent with the original intention. It is suitable for front-end and back-end data processing and large and small data processing. The application scenarios are relatively focused, and the versatility is not as good as Kotlin.

programming paradigm

Kotlin focuses on object-oriented programming, but also supports functional programming. Scala supports both paradigms, object-oriented programming is more thorough than Koltin, and functional programming is more convenient than Koltin. It can be said that SPL does not support object-oriented programming. It has the concept of objects, but does not inherit and overload these contents. Functional programming is more convenient than Kotlin.

operating mode

Kotlin and Scala are compiled languages ​​and SPL is an interpreted language. Interpreted languages ​​are more flexible, but the performance of the same code will be slightly worse. However, SPL has rich and efficient library functions, the overall performance is not weak, and it is often more advantageous in the face of big data.

external class library

Kotlin can use all Java libraries, but lacks professional data processing libraries. Scala can also use all Java class libraries, and has a built-in professional big data processing class library (Spark). SPL has built-in professional data processing functions, which provide a large number of basic operations with lower time complexity. Usually, external Java class libraries are not required. In special cases, they can be called in custom functions.

IDE and debugging

All three have graphical IDEs and full debugging capabilities. SPL's IDE is specially designed for data processing. Structured data objects are presented in tabular form, which is more convenient to observe. The IDEs of Kotlin and Scala are general-purpose and are not optimized for data processing, so structured data objects cannot be easily observed.

learning difficulty

Kotlin is slightly more difficult to learn than Java, and those who are proficient in Java can easily learn it. The goal of Scala is to surpass Java, which is far more difficult to learn than Java. The goal of SPL is to simplify the coding of Java and even SQL. Many concepts are deliberately simplified, and the learning difficulty is very low.

amount of code

The original intention of Kotlin is to improve the development efficiency of Java. Officials claim that the overall code size is only 20% of that of Java, which may be due to the unprofessionalism of the data processing class library, and the actual code size in this regard is not much reduced. Scala has a lot of syntactic sugar, and the big data processing class library is more professional, but the amount of code is much lower than that of Kotlin. SPL is only used for data processing, and has the strongest professionalism. Coupled with the characteristics of strong expression ability of interpreted language, the amount of code to complete the same task is much lower than the previous two (there will be comparison examples later). It can be explained that the learning difficulty is lower.

grammar

type of data

Atomic data types: all three are supported, such as Short, Int, Long, Float, Double, Boolean

Date and time types: Kotlin lacks easy-to-use date and time types, usually Java ones. Both Scala and SPL have specialized and convenient datetime types.

Distinctive data types: Kotlin supports non-numeric characters Char, and the nullable type Any?. Scala supports tuples (fixed-length generic collections), built-in BigDecimal. SPL supports high-performance multi-layer ordinal keys with built-in BigDecimal.

Collection types: Kotlin and Scala support Set, List, Map. SPL supports sequences (ordered generic collections, similar to List).

Structured data type: Kotlin has a record collection List<EntityBean>, but it lacks metadata and is not professional enough. Scala has professional structured data types, including Row, RDD, DataSet, DataFrame (this article uses this as an example), etc. SPL has professional structured data types, including record, sequence table (this article uses this as an example), internal table compression table, external memory Lazy cursor, etc.

Scala has unique implicit conversion capabilities, which can theoretically convert between any data types (including parameters, variables, functions, and classes), and can easily change or enhance the original functions.

Process processing

All three support basic sequential execution, judging branches, and loops. In theory, arbitrarily complex process processing can be performed. This aspect is not discussed much. The following focuses on comparing the convenience of the loop structure for aggregate data. Taking the calculation ratio as an example, the Kotlin code:

mData.forEachIndexed{index,it->
if(index>0) it.Mom= it.Amount/mData[index-1].Amount-1
}

Kotlin's forEachIndexed function has its own serial number variable and member variable, which is more convenient for collection looping, supports subscripting records, and can easily perform cross-line calculations. The downside of Kotlin is the extra handling of array out-of-bounds.

Scala code:

val w = Window.orderBy(mData("SellerId"))
mData.withColumn("Mom", mData ("Amount")/lag(mData ("Amount"),1).over(w)-1)

Scala cross-row calculations do not have to deal with array out-of-bounds, which is more convenient than Kotlin. However, Scala's structured data objects do not support subscripting records, and can only use the lag function to move as a whole, which is not convenient for structured data. The lag function cannot be used for the versatile forEach, but with a single-function loop function such as withColumn. In order to maintain the underlying unity of functional programming style and SQL style, the lag function must also cooperate with the window function (Python's transition function does not have this requirement), and the overall code looks more complicated than Kotlin.

SPL code:

mData.(Mom=Amount/Amount[-1]-1)

SPL has carried out a number of optimizations on the flow control of structured data objects. Similar to forEach, the most common and commonly used loop function, SPL can be directly expressed in parentheses, which is simplified to the extreme. SPL also has a move function, but the more intuitive "[relative position]" syntax is used here, which is more powerful than Kotlin's absolute positioning when performing cross-line calculations, and more convenient than Scala's move function. In addition to the above code, SPL has more process processing functions for structured data, such as: taking a batch of records instead of one record in each round of loops; looping round when a field value changes.

Lambda expressions

Lambda expressions are simple implementations of anonymous functions, and their purpose is to simplify the definition of functions, especially the diverse set computing functions. Kotlin supports Lambda expressions, but due to the relationship of compiled languages, it is difficult to easily specify parameter expressions as value parameters or function parameters. Only complex interface rules can be designed to distinguish them, and there are even so-called dedicated interfaces for higher-order functions. This makes Kotin's Lambda expression difficult to write and lacks professionalism in data processing. A few examples:

"abcd".substring( 1,2)						//值参数
"abcd".sumBy{ it.toInt()}					//函数参数
mData.forEachIndexed{ index,it-> if(index>0) it.Mom=…}		//函数参数的函数带多个参数

Koltin's Lambda expression is not professional enough, and it also shows that the variable name (it) of the structured data object must be used when using the field, and the table name can be omitted when calculating a single table like SQL.

As a compiled language, Scala's Lambda expression is not much different from Kotlin. It also needs to design complex interface rules, which is also difficult to write, so I won't give an example here. When calculating more than the previous period, the variable name of the structured data object should also be put before the field or the col function should be used, such as mData ("Amount") or col ("Amount"), although it can be supplemented with syntactic sugar, written as $"Amount" Or 'Amount, but many functions do not support this way of writing, and insisting on making up will make the style inconsistent.

SPL's Lambda expression is simple and easy to use, and more professional than the previous two, which is related to the characteristics of its interpreted language. Interpreted languages ​​can easily infer value parameters and function parameters, there is no so-called complex special interface for higher-order functions, and all function interfaces are equally simple. A few examples:

mid("abcd",2,1)							//值参数
Orders.sum(Amount*Amount)					//函数参数
mData.(Mom=Amount/Amount[-1]-1)					//函数参数的函数带多个参数

SPL can use field names directly without structured data object variable names, such as:

Orders.select(Amount>1000 && Amount<=3000 && like(Client,"*S*"))

Most loop functions of SPL have default member variables ~ and ordinal variables #, which can significantly improve the convenience of code writing, especially for structured data calculations. For example, to retrieve records at even positions:

Students.select(# % 2==0)

Find the top 3 in each group:

Orders.group(SellerId;~.top(3;Amount))

SPL function options and level parameters

It is worth mentioning that, in order to further improve development efficiency, SPL also provides a unique function syntax.

When there are a large number of functions with similar functions, most programming languages ​​can only be distinguished by different names or parameters, which is inconvenient to use. SPL provides very unique function options, so that functions with similar functions can share a function name, and only use function options to distinguish the difference. For example, the basic function of the select function is to filter. If only the first record that meets the conditions is filtered, the option @1 can be used:

T.select@1(Amount>1000)

For fast filtering of ordered data with dichotomy, use @b:

T.select@b(Amount>1000)

Function options can also be combined, for example:

Orders.select@1b(Amount>1000)

The parameters of some functions are complex and may be divided into multiple layers. Conventional programming languages ​​have no special syntax scheme for this, and can only generate multi-layer structured data objects and then pass them in, which is very troublesome. SQL uses keywords to separate parameters into multiple groups, which is more intuitive and simple, but it will use a lot of keywords and make the statement structure inconsistent. SPL creatively invented hierarchical parameters to simplify the expression of complex parameters, and divided the parameters into three layers from high to low through semicolons, commas, and colons:

join(Orders:o,SellerId ; Employees:e,EId)

data source

Kind of data source

In principle, Kotlin can support all Java data sources, but the code is very cumbersome, the type conversion is troublesome, and the stability is poor. This is because Kotlin does not have a built-in data source access interface, nor is it optimized for structured data processing (except for the JDBC interface). ). In this sense, it can also be said that it does not directly support any data source, and can only use Java third-party class libraries. Fortunately, the number of third-party class libraries is large enough.

Scala supports many types of data sources, and six data source interfaces are built-in and optimized for structured data processing, including: JDBC, CSV, TXT, JSON, Parquet column storage format, ORC column storage format, Although other data source interfaces are not built-in, they can use third-party libraries developed by the community group. Scala provides a data source interface specification that requires third-party class libraries to output structured data objects. Common third-party interfaces include XML, Cassandra, HBase, and MongoDB.

SPL has the most built-in data source interfaces and is optimized for structured data processing, including:

JDBC (i.e. all RDBs)

CSV、TXT、JSON、XML、Excel

HBase、HDFS、Hive、Spark

Salesforce, Alibaba Cloud

Restful、WebService、Webcrawl

Elasticsearch、MongoDB、Kafka、R2dbc、FTP

Cassandra、DynamoDB、influxDB、Redis、SAP

These data sources can be used directly, which is very convenient. For other data sources that are not listed, SPL also provides interface specifications. As long as the specification is output as a structured data object of SPL, subsequent calculations can be performed.

code comparison

Take a canonical CSV file as an example to compare the parsing code in three languages. Kotlin:

val file = File("D:\\data\\Orders.txt")
data class Order(var OrderID: Int,var Client: String,var SellerId: Int, var Amount: Double, var OrderDate: Date)
var sdf = SimpleDateFormat("yyyy-MM-dd")
var Orders=file.readLines().drop(1).map{
var l=it.split("\t")
var r=Order(l[0].toInt(),l[1],l[2].toInt(),l[3].toDouble(),sdf.parse(l[4]))
r
}
var resutl=Orders.filter{
it.Amount>= 1000 && it.Amount < 3000}

Koltin lacks professionalism, and usually has to hard-write code to read CSV, including defining the data structure in advance, and manually parsing the data type in the loop function. The overall code is quite cumbersome. It can also be read with class libraries such as OpenCSV. Although the data type does not need to be parsed in the code, it must be defined in the configuration file, and the implementation process is not necessarily simple.

Scala is professional and has a built-in interface for parsing CSV. The code is much shorter than Koltin:

val spark = SparkSession.builder().master("local").getOrCreate()
val Orders = spark.read.option("header", "true").option("sep","\t").option("inferSchema", "true").csv("D:/data/orders.csv").withColumn("OrderDate", col("OrderDate").cast(DateType))
Orders.filter("Amount>1000 and Amount<=3000")

Scala has trouble parsing data types, but otherwise has no obvious shortcomings.

SPL is more professional, with only one line for analysis and calculation:

T("D:/data/orders.csv").select(Amount>1000 && Amount<=3000)

Cross-Origin Computing

The JVM data processing language is highly open and has sufficient capabilities to associate, merge, and aggregate different data sources.

Kotlin is not professional enough. It not only lacks built-in data source interfaces, but also lacks cross-source computing functions. It can only be implemented by hard-writing code. Assuming that the employee table and the order table have been obtained from different data sources, now associate the two:

data class OrderNew(var OrderID:Int ,var Client:String, var SellerId:Employee ,var Amount:Double ,var OrderDate:Date )
val result = Orders.map { o->var emp=Employees.firstOrNull{ it.EId==o.SellerId
}
emp?.let{ OrderNew(o.OrderID,o.Client,emp,o.Amount,o.OrderDate)
}
}
.filter {o->o!=null}

It is easy to see the shortcomings of Kotlin. As long as the code is long, the Lambda expression becomes difficult to read, and it is not as easy to understand as ordinary code; the associated data structure needs to be defined in advance, which has poor flexibility and affects the fluency of problem solving.

Scala is more professional than Kotlin. It not only has multiple built-in data source interfaces, but also provides functions for cross-source computing. The same calculation, the Scala code is much simpler:

val join=Orders.join(Employees,Orders("SellerId")===Employees("EId"),"Inner")

It can be seen that Scala not only has objects and functions dedicated to the calculation of structured data, but also works well with the Lambda language. The code is easier to understand, and there is no need to define the data structure in advance.

SPL is more professional, structured data objects are more professional, cross-source computing functions are more convenient, and the code is shorter:

join(Orders:o,SellerId;Employees:e,EId)

Own storage format

Intermediate data that is used repeatedly is usually saved as a local file in a certain format to improve the fetching performance. Kotlin supports files in a variety of formats, and can theoretically store and recalculate intermediate data, but because it is not professional in data processing, basic read and write operations require writing large pieces of code, which is equivalent to having no own storage format. .

Scala supports a variety of storage formats, among which parquet files are commonly used and easy to use. Parquet is an open source storage format that supports column storage and can store a large amount of data. Intermediate calculation results (DataFrame) can be easily converted to and from parquet files. Unfortunately, indexes for parquet are not yet mature.

val df = spark.read.parquet("input.parquet")
val result=df.groupBy(data("Dept"),data("Gender")).agg(sum("Amount"),count("*"))
result.write.parquet("output.parquet")

SPL supports btx and ctx two private binary storage formats, btx is a simple row storage, ctx supports row storage, column storage, index, can store a large amount of data and perform high-performance computing, intermediate calculation results (sequence table / cursor) can be compared with this The two files can be easily converted to each other.

A
1 =file("input.ctx").open()
2 =A1.cursor(Dept,Gender,Amount).groups(Dept,Gender;sum(Amount):amt,count(1):cnt)
3 =file("output.ctx").create(#Dept,#Gender,amt,cnt).append(A2.cursor())

Structured Data Computing

structured data object

The core of data processing is computation, especially the computation of structured data. The professional degree of structured data objects profoundly determines the convenience of data processing.

Kotlin does not have a professional structured data object, and List<EntityBean> is often used for structured data calculation, in which EntityBean can use data class to simplify the definition process.

List is an ordered collection (repeatable), and Kotlin supports all functions involving member numbers and collections. For example, to access members by serial number:

Orders[3]						//按下标取记录,从0开始
Orders.take(3)						//前3条记录
Orders.slice(listOf(1,3,5)+IntRange(7,10))		//下标是1、3、5、7-10的记录

You can also take members by their reciprocal numbers:

Orders.reversed().slice(1,3,5)				//倒数第1、3、5条
Orders.take(1)+Orders.takeLast(1)			//第1条和最后1条

Calculations involving order are relatively difficult. Kotlin supports ordered count sets, and it is more convenient to perform related calculations. As a type of collection, List is good at functions such as addition, deletion, modification, intersection, and splitting of collection members. But List is not a professional structured data object. Once the functions related to field structure are involved, Kotlin is difficult to implement. For example, take two fields in Orders to form a new structured data object.

data class CliAmt(var Client: String, var Amount: Double)
var CliAmts=Orders.map{it.let{CliAmt(it.Client,it.Amount) }}

The above functions are very commonly used, and are equivalent to the simple SQL statement select Client, Amount from Orders, but Kotlin is very cumbersome to write, not only to define a new structure in advance, but also to hard-code the assignment of fields. The simple function of taking fields is so cumbersome, and the advanced functions are more troublesome, such as: taking by field serial number, taking by parameter, getting a list of field names, modifying field structure, defining keys and indexes on fields, querying and calculating by field .

Scala also has List, which is not much different from Kotlin, but Scala has designed a more professional data object DataFrame (and RDD, DataSet) for structured data processing.
DataFrame is a structured data stream, which is somewhat similar to database result set. It is an unordered collection, so it does not support pressing the number to fetch, and can only be implemented in disguise. For example, the 10th record:

Orders.limit(10).tail(1)(0)						

It is conceivable that all the calculations related to the order, DataFrame is more troublesome to implement, such as interval, moving average, reverse sorting and so on.
In addition to data unordered, DataFrame does not support modification (immutable feature), if you want to change the data or structure, you must generate a new DataFrame. For example, to modify the field name, it is actually achieved by copying the record:

Orders.selectExpr("Client as Cli")					

DataFrame supports common set calculations, such as splitting, merging, and cross-merging, where union can be achieved by combining sets to remove duplicates, but because it is implemented by copying records, the performance of set calculations is generally not high.
Although there are many shortcomings, DataFrame is a professional structured data object, and the ability to access fields is beyond the reach of Kotlin. For example, to get a list of metadata/field names:

Orders.schema.fields.map(it=>it.name).toList

It is also convenient to use fields to fetch numbers, for example, to fetch two fields to form a new dataframe:

Orders.select("Client","Amount")				//可以只用字段名

Or form a new DataFrame with computed columns:

Orders.select(Orders("Client"),Orders("Amount")+1000)		//不能只用字段名	

Unfortunately, DataFrame only supports referring to fields by name in string form, and does not support field serial numbers or default names, which is inconvenient in many scenarios. In addition, DataFrame does not support defining indexes, so high-performance random query cannot be performed, and professionalism is still flawed.

The structured data object of SPL is a sequence table, which has the advantages of being professional enough, easy to use, and expressive.
Access members by sequence number:

Orders(3)							//按下标取记录,从1开始
Orders.to(3)							//前3条记录
Orders.m(1,3,5,7:10)						//序号是1、3、5、7-10的记录

The record is taken by the reciprocal number. The unique feature is that it supports the negative sign to indicate the reciprocal, which is more professional and more convenient than Kotlin:

Orders.m(-1,-3,-5)						//倒数第1,3,5条
Orders.m(1,-1)							//第1条和最后1条

As a kind of set, the sequence table also supports the functions of adding, deleting, modifying, intersecting, merging, dividing, and splitting members of the set. Since the sequence table is a mutable collection like the List, the collection calculation uses the free records as much as possible instead of copying the records, the performance is much better than that of Scala, and the memory usage is also less.
Sequence table is a professional structured data object, in addition to collection related functions, more importantly, it can easily access fields. For example, to get a list of field names:

Orders.fname()							

Take two fields to form a new sequence table:

Orders.new(Client,Amount)

Form a new sequence table with computed columns:

Orders.new(Client,Amount*0.2)

Modify the field name:

Orders.alter(;OrderDate)					//不复制记录

In some scenarios, you need to access fields with field numbers or default names. SPL provides corresponding access methods:

Orders(Client)							//按字段名(表达式取)
Orders([#2,#3])							//按默认字段名取
Orders.field(“Client”)						//按字符串(外部参数)
Orders.field(2)							//按字段序号取

As a professional structured data object, the sequence table also supports defining keys and indexes on fields:

Orders.keys@i(OrderID)						//定义键,同时建立哈希索引
Orders.find(47)							//用索引高速查找

calculation function

Kotlin supports some basic calculation functions, including: filtering, sorting, deduplication, cross-merging of sets, various aggregations, and grouping and summarizing. However, these functions are all for ordinary collections. If the calculation target is changed to structured data objects, the calculation function library is very insufficient, and it is usually supplemented by hard coding to realize the calculation. There are many basic set operations that are not supported by Kotlin and can only be implemented by coding, including: association, window function, ranking, row-to-column, merge, binary search, etc. Among them, merge and binary search are order-related operations. Since Kotlin List is an ordered set, it is not too difficult to implement such operations by coding. In general, Kotlin's function library can be said to be weak in the face of structured data computing.

Scala's computing functions are relatively rich, and they are all designed for structured data objects, including functions not supported by Kotlin: ranking, association, window function, row-to-column, but basically not beyond the framework of SQL. There are also some basic set operations that are not supported by Scala, especially those related to order, such as merge and binary search. Since Scala DataFrame uses the concept of data disorder in SQL, it is very difficult to implement such operations even if you code it yourself. of. In general, Scala's function library is richer than Kotlin, but basic operations are still missing.

SPL has the most abundant computing functions, and they are all designed for structured data objects. SPL greatly enriches the content of structured data operations, and designs a lot of content beyond SQL. Of course, it is also a function that Scala/Kotlin does not support, such as ordered Calculation: Merge, binary search, record by interval, sequence number of eligible records; in addition to regular equivalent grouping, enumeration grouping, alignment grouping, and ordered grouping are also supported; association types are divided into foreign keys and primary children; primary keys are supported to constrain Data, support index for fast query; recursive query for multi-level data (multi-table association or Json\XML), etc.

Taking grouping as an example, in addition to the regular equivalent grouping, SPL also provides more grouping schemes:

Enumeration grouping: The grouping is based on several conditional expressions, and records that meet the same conditions are grouped into a group.

Alignment grouping: The grouping is based on an external set. Records whose field values ​​are equal to the members of the set are grouped into a group. The order of the groups is consistent with the order of the members of the set. Empty groups are allowed. records belonging to this collection".

Ordered grouping: The grouping is based on already-ordered fields. For example, a new group is created when a field changes or a certain condition is established. SPL directly provides such ordered grouping, which can be done by adding an option to the regular grouping function. It's very simple and the computing performance is better. Other languages ​​(including SQL) do not have this grouping and can only laboriously convert to traditional equivalent grouping or hardcode it yourself.

Let's take a few general examples to get a feel for the differences in the way these three languages ​​compute functions.

sort

Sort by Client order, Amount in reverse order. Kotlin:

Orders.sortedBy{it.Amount}.sortedByDescending{it.Client}

The Kotlin code is not long, but there are still inconveniences, including: the reverse order and the positive order are two different functions, the field name must contain the table name, and the field order written in the code is opposite to the actual sort order.

Scala:

Orders.orderBy(Orders("Client"),-Orders("Amount"))

Scala is much simpler, the negative sign represents the reverse order, and the field order written by the code is the same as the sorting order. Unfortunately, fields still need to have table names; compiled languages ​​can only use strings to implement dynamic parsing of expressions, resulting in inconsistent code styles.

SPL:

Orders.sort(Client,-Amount)

The SPL code is simpler, the fields do not need to have table names, and the interpreted language code style is easy to unify.

Group summary

Kotlin:

data class Grp(var Dept:String,var Gender:String) 
data class Agg(var sumAmount: Double,var rowCount:Int)
var result1=data.groupingBy{Grp(it!!.Dept,it.Gender)}
.fold(Agg(0.0,0),{acc, elem -> Agg(acc.sumAmount + elem!!.Amount,acc.rowCount+1)})
.toSortedMap(compareBy<Grp> { it.Dept }.thenBy { it.Gender })

Kotlin code is cumbersome, not only using groupingBy and fold functions, but also hard-coding to achieve grouping and summarization. When a new data structure appears, it must be defined in advance before it can be used, such as the grouped two-field structure and the aggregated two-field structure, which not only has poor flexibility, but also affects the fluency of problem solving. The final sorting is to be consistent with the result order in other languages ​​and is not required.

Scala:

val result=data.groupBy(data("Dept"),data("Gender")).agg(sum("Amount"),count("*"))

Scala code is much simpler, not only easy to understand, but also does not need to define data structures in advance.

SPL:

data.groups(Dept,Gender;sum(Amount),count(1))

The SPL code is the simplest, and its expressive ability is not lower than that of SQL.

Associative computing

Two tables have fields with the same name, which are related and grouped together. Kotlin code:

data class OrderNew(var OrderID:Int ,var Client:String, var SellerId:Employee ,var Amount:Double ,var OrderDate:Date )
val result = Orders.map { o->var emp=Employees.firstOrNull{it.EId==o.EId}
emp?.let{ OrderNew(o.OrderID,o.Client,emp,o.Amount,o.OrderDate)}
}
.filter {o->o!=null}
data class Grp(var Dept:String,var Gender:String) 
data class Agg(var sumAmount: Double,var rowCount:Int)
var result1=data.groupingBy{Grp(it!!.EId.Dept,it.EId.Gender)}
.fold(Agg(0.0,0),{acc, elem -> Agg(acc.sumAmount + elem!!.Amount,acc.rowCount+1)})
.toSortedMap(compareBy<Grp> { it.Dept }.thenBy { it.Gender })

Kotlin code is cumbersome, and new data structures must be defined in many places, including association results, grouped two-field structures, and aggregated two-field structures.

Scala

val join=Orders.as("o").join(Employees.as("e"),Orders("EId")===Employees("EId"),"Inner")
val result= join.groupBy(join("e.Dept"), join("e.Gender")).agg(sum("o.Amount"),count("*"))

Scala is much simpler than Kolin, without cumbersomely defining data structures and without hardcoding.

SPL is simpler:

join(Orders:o,SellerId;Employees:e,EId).groups(e.Dept,e.Gender;sum(o.Amount),count(1))

Comprehensive data processing comparison

The content of the CSV is not standardized. Every three lines corresponds to a record, and the second line contains three fields (that is, a collection of collections). The file is organized into a standardized structured data object, and sorted by the 3rd and 4th fields.

Kotlin:

data class Order(var OrderID: Int,var Client: String,var SellerId: Int, var Amount: Double, var OrderDate: Date)
var Orders=ArrayList<Order>()
var sdf = SimpleDateFormat("yyyy-MM-dd")
var raw=File("d:\\threelines.txt").readLines()
raw.forEachIndexed{index,it->
if(index % 3==0) {
var f234=raw[index+1].split("\t")
var r=Order(raw[index].toInt(),f234[0],f234[1].toInt(),f234[2].toDouble(),
sdf.parse(raw[index+2]))
Orders.add(r)
}
}
var result=Orders.sortedByDescending{it.Amount}.sortedBy{it.SellerId}

Koltin is not very professional in data processing, and most of the functions require hard-coded codes, including taking fields by position and taking fields from a collection of collections.

Scala:

val raw=spark.read.text("D:/threelines.txt")
val rawrn=raw.withColumn("rn", monotonically_increasing_id())
var f1=rawrn.filter("rn % 3==0").withColumnRenamed("value","OrderId")
var f5=rawrn.filter("rn % 3==2").withColumnRenamed("value","OrderDate")
var f234=rawrn.filter("rn % 3==1")
.withColumn("splited",split(col("value"),"\t"))
.select(col("splited").getItem(0).as("Client")
,col("splited").getItem(1).as("SellerId")
,col("splited").getItem(2).as("Amount"))
f1.withColumn("rn1",monotonically_increasing_id())
f5=f5.withColumn("rn1",monotonically_increasing_id())
f234=f234.withColumn("rn1",monotonically_increasing_id())
var f=f1.join(f234,f1("rn1")===f234("rn1"))
.join(f5,f1("rn1")===f5("rn1"))
.select("OrderId","Client","SellerId","Amount","OrderDate")
val result=f.orderBy(col("SellerId"),-col("Amount"))

Scala is more specialized in data processing, making heavy use of structured computation functions instead of hard-writing looping code. However, Scala lacks the ability of ordered computing, and related functions usually need to be processed by adding a sequence number column, resulting in a lengthy overall code.
SPL:

A
1 =file("D:\\data.csv").import@si()
2 =A1.group((#-1)\3)
3 =A2.new(~(1):OrderID, (line=~(2).array("\t"))(1):Client,line(2):SellerId,line(3):Amount,~(3):OrderDate )
4 =A3.sort(SellerId,-Amount)

SPL is the most professional in data processing, and can achieve goals with only structured calculation functions. SPL supports ordered calculation, you can directly group by position, take fields by position, and take fields from collections in collections. Although the implementation idea is similar to Scala, the code is much shorter.

Application structure

Java application integration

Kotlin is compiled into bytecode, which, like ordinary class files, can be easily called by Java. For example, the static method fun multiLines(): List<Order> in KotlinFile.kt will be correctly recognized by Java and can be called directly:

java.util.List result=KotlinFileKt.multiLines();
result.forEach(e->{System.out.println(e);});

Scala is also bytecode after compilation, which can also be easily called by Java. For example, the static method def multiLines():DataFrame of the ScalaObject object will be recognized as a Dataset type by Java, and it can be called with a little modification:

org.apache.spark.sql.Dataset df=ScalaObject.multiLines();
df.show();

SPL provides a common JDBC interface, and simple SPL code can be directly embedded in Java like SQL:

Class.forName("com.esproc.jdbc.InternalDriver");
Connection connection =DriverManager.getConnection("jdbc:esproc:local://");
Statement statement = connection.createStatement();
String str="=T(\"D:/Orders.xls\").select(Amount>1000 && Amount<=3000 && like(Client,\"*s*\"))";
ResultSet result = statement.executeQuery(str);

The complex SPL code can be stored as a script file first, and then called by Java in the form of a stored procedure, which can effectively reduce the coupling between the calculation code and the front-end application.

Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
CallableStatement statement = conn.prepareCall("{call scriptFileName(?, ?)}");
statement.setObject(1, "2020-01-01");
statement.setObject(2, "2020-01-31");
statement.execute();

SPL is an interpreted language. After modification, it can be directly executed without compiling. It supports code hot switching, which can reduce maintenance workload and improve system stability. Kotlin and Scala are compiled languages, and applications must be restarted at some point after compilation.

Open source is the spirit of every programmer, everyone can communicate and learn together

interactive command line

Kotlin's interactive command line requires an additional download, which is started using the Kotlinc command. The Kotlin command line can theoretically perform arbitrarily complex data processing, but because the code is generally long and difficult to modify on the command line, it is more suitable for simple numerical calculations:

>>>Math.sqrt(5.0)
2.236.6797749979

Scala's interactive command line is built-in, launched with the command of the same name. The Scala command line can theoretically perform data processing, but because the code is relatively long, it is more suitable for simple numerical calculations:

scala>100*3
rest1: Int=300

SPL has built-in interactive command line, start with "esprocx -r -c" command. SPL codes are generally shorter and allow for simple data manipulation at the command line.

(1): T("d:/Orders.txt").groups(SellerId;sum(Amount):amt).select(amt>2000)
(2):^C
D:\raqsoft64\esProc\bin>Log level:INFO
1       4263.900000000001
3       7624.599999999999
4       14128.599999999999
5       26942.4

SPL Information

Databases are becoming more and more important in our use. Through various comparisons, we can see that for common data processing tasks in application development, Kotlin has low development efficiency because it is not professional enough; Scala has certain professionalism and development efficiency. Higher than Kotlin, but not as good as SPL; SPL has more concise syntax, higher expression efficiency, more types of data sources, easier interfaces, more professional structured data objects, richer functions and stronger computing power, and far more efficient development. Higher than Kotlin and Scala.

Welcome to exchange and study.

Guess you like

Origin blog.csdn.net/mengchuan6666/article/details/126616736