Optimization of Mysql query on big datasets

Simon :

I am relatively new to MySQL and since days I try to improve a query which is performed on a big table (>70 Mio. rows). The good news, I got the correct result. Unfortunately, it takes round about 7 minutes for calculation, which is (in my opinion) way to long. I searched for ways to improve queries and came up with a multiple column index of the table, but now I am somehow stuck and don't know how to improve further.

The table consists of 5 columns, all of them are of interest. The columns state, cur, min and max are INT. The sample column stores some names, which are multiple times in the this column. Every row stores the information of one state of one sample. Every sample has millions of different states. The table looks like this:

samples state   cur min max

Sample1 1       58  12  110
Sample1 0       8   12  110
Sample2 1       70  150 190
Sample4 2       10  1   20
Sample3 2       80  50  70
Sample6 2       3   1   10
Sample5 0       18  21  90
Sample5 1       22  21  90
.
.
.

Now I want to perform some statistics on the table: How many times is a state with cur between min and max. I want to get this numbers for every sample. In addition, I also want to calculate the relative amount of the states 1, 2 and 3 for every sample.

The resulting table looks like this:

        total amount    state 0     state 1 state 2 state 0 %   state 1 %   state 2 %

Sample1 14504366        13199105    961629  343632  91.0009     6.6299      2.3692
Sample2 13873909        12628523    926846  318540  91.0235     6.6805      2.2960
Sample3 10919017        9231997     828767  858253  84.5497     7.5901      7.8602
Sample4 10148540        8604527     768220  775793  84.7859     7.5698      7.6444
Sample5 14130796        12382867    1078724 669205  87.6304     7.6339      4.7358
Sample6 11307051        9947652     871388  488011  87.9774     7.7066      4.3160

I got the result by using the following code:

# state, cur, min and max are INT
# "samples" is Varchar40

# build the index
ALTER TABLE data_table ADD INDEX `index_name` (state, cur, min, max, samples);

# query
SELECT t.samples, 
COUNT(t.state) AS "total amount", 
amount_0 AS "state 0", 
amount_1 AS "state 1", 
amount_2 AS "state 2", 
amount_0 / COUNT(t.state) * 100 AS "state 0 %", 
amount_1 / COUNT(t.state) * 100 AS "state 1 %", 
amount_2 / COUNT(t.state) * 100 AS "state 2 %"
FROM data_table t
JOIN
(
   SELECT samples, COUNT(state) as amount_0
   FROM data_table
   WHERE state = 0 AND cur > min + 15 AND cur < max -20
   GROUP BY samples
) tmp0 ON tmp0.samples = t.samples
JOIN
(
   SELECT samples, COUNT(state) as amount_1
   FROM data_table
   WHERE state = 1 AND cur > min + 15 AND cur < max -20
   GROUP BY samples
) tmp1 ON tmp1.samples = t.samples
JOIN
(
   SELECT samples, COUNT(state) as amount_2
   FROM data_table
   WHERE state = 2 AND cur > min + 15 AND cur < max -20
   GROUP BY samples
) tmp2 ON tmp2.samples = t.samples
WHERE cur > min + 15 AND cur < max -20
GROUP BY t.samples;

EXPLAIN returns following output:

id  select_type table       partitions  type        possible_keys   key             key_len ref             rows        filtered    Extra   
1   PRIMARY     t           [NULL]      index       index_name      index_name      143     [NULL]          73647812    11.11       Using where; Using index; Using temporary; Using filesort
1   PRIMARY     <derived4>  [NULL]      ref         <auto_key0>     <auto_key0>     123     db.t.Sample     10          100         [NULL]
1   PRIMARY     <derived2>  [NULL]      ref         <auto_key0>     <auto_key0>     123     db.t.Sample     10          100         [NULL]
1   PRIMARY     <derived3>  [NULL]      ref         <auto_key0>     <auto_key0>     123     db.t.Sample     10          100         [NULL]
4   DERIVED     data_table  [NULL]      ref         index_name      index_name      5       const           7547114     11.11       Using where; Using index; Using temporary; Using filesort
3   DERIVED     data_table  [NULL]      ref         index_name      index_name      5       const           15150796    11.11       Using where; Using index; Using temporary; Using filesort
2   DERIVED     data_table  [NULL]      ref         index_name      index_name      5       const           36823906    11.11       Using where; Using index; Using temporary; Using filesort

I think a big problem are the JOIN parts, where probably the indices get lost. A second time consuming step could be - if I unterstand correctly - the usage of WHERE, which will lead to join everything and than removing the rows which are not fitting the case in a second step. This is for sure time consuming and can be avoided by using ON instead of WHERE(?). My problem is, I don't know how to implement the workaround. Therefore I hope you can help me out.

scaisEdge :

You could avoid some subquery using conditional aggregation

SELECT t.samples, 
COUNT(t.state) AS "total amount", 
amount_0 AS "state 0", 
amount_1 AS "state 1", 
amount_2 AS "state 2", 
amount_0 / COUNT(t.state) * 100 AS "state 0 %", 
amount_1 / COUNT(t.state) * 100 AS "state 1 %", 
amount_2 / COUNT(t.state) * 100 AS "state 2 %"
FROM data_table t
INNER JOIN (
   SELECT samples
    , COUNT(case when state = 0 then 1 else null end) as amount_0
    , COUNT(case when state = 1 then 1 else null end) as amount_1
    , COUNT(case when state = 2 then 1 else null end) as amount_0
   FROM data_table
   WHERE  cur > min + 15 AND cur < max -20
   GROUP BY samples
) tmp ON tmp.samples = t.samples
WHERE cur > min + 15 AND cur < max -20
GROUP BY t.samples;

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=5583&siteId=1