[Data Mining] How to ensure data consistency?

 

1. Description

         I used to work as a data analyst for a web analytics service company. Such systems help websites collect and analyze customer behavior data. It goes without saying that data is the most valuable value of web analytics services. One of my main goals is to monitor data quality.

To make sure everything is fine with the data, we need to focus on two things:

  • No missing or duplicated events -> event and session counts are within expected range.
  • The data is correct -> the distribution of values ​​for each parameter remains the same, the other version has yet to start logging all browsers as Safari or stop tracking purchases entirely.

        Today, I want to tell you about my experience with this complex task. As a bonus, I'll show examples of ClickHouse array functions.

Photo by Luke Chesser  on  Unsplash

2. What is network analysis?

        Web analytics systems record a lot of information about events on a website, for example, which browsers and operating systems are used by customers, which URLs they visit, how much time they spend on the website, and even which products they add to their shopping carts and buy. . All of this data can be used for reporting (to understand how many customers visited the site) or analytics (to understand pain points and improve customer experience). You can find more details about web analytics on Wikipedia .

        We will use ClickHouse's anonymous web analytics data. A guide describing how to load it can be found here .

        Let's look at the data. is the unique identifier for the session, while the other parameters are characteristics of this session. Looks like numeric variables, but they are browser and operating system encoded names. It is much more efficient to store these values ​​(like numbers) and then decode the values ​​at the application level. This optimization is very important and can save you terabytes if you're dealing with big data.VisitIDUserAgentOS

SELECT
    VisitID,
    StartDate,
    UTCStartTime,
    Duration,
    PageViews,
    StartURLDomain,
    IsMobile,
    UserAgent,
    OS
FROM datasets.visits_v1
FINAL
LIMIT 10

┌─────────────VisitID─┬──StartDate─┬────────UTCStartTime─┬─Duration─┬─PageViews─┬─StartURLDomain─────────┬─IsMobile─┬─UserAgent─┬──OS─┐
│ 6949594573706600954 │ 2014-03-17 │ 2014-03-17 11:38:42 │        0 │         1 │ gruzomoy.sumtel.com.ua │        0 │         7 │   2 │
│ 7763399689682887827 │ 2014-03-17 │ 2014-03-17 18:22:20 │       24 │         3 │ gruzomoy.sumtel.com.ua │        0 │         2 │   2 │
│ 9153706821504089082 │ 2014-03-17 │ 2014-03-17 09:41:09 │      415 │         9 │ gruzomoy.sumtel.com.ua │        0 │         7 │  35 │
│ 5747643029332244007 │ 2014-03-17 │ 2014-03-17 04:46:08 │       19 │         1 │ gruzomoy.sumtel.com.ua │        0 │         2 │ 238 │
│ 5868920473837897470 │ 2014-03-17 │ 2014-03-17 10:10:31 │       11 │         1 │ gruzomoy.sumtel.com.ua │        0 │         3 │  35 │
│ 6587050697748196290 │ 2014-03-17 │ 2014-03-17 09:06:47 │       18 │         2 │ gruzomoy.sumtel.com.ua │        0 │       120 │  35 │
│ 8872348705743297525 │ 2014-03-17 │ 2014-03-17 06:40:43 │      190 │         6 │ gruzomoy.sumtel.com.ua │        0 │         5 │ 238 │
│ 8890846394730359529 │ 2014-03-17 │ 2014-03-17 02:27:19 │        0 │         1 │ gruzomoy.sumtel.com.ua │        0 │        57 │  35 │
│ 7429587367586011403 │ 2014-03-17 │ 2014-03-17 01:13:14 │        0 │         1 │ gruzomoy.sumtel.com.ua │        1 │         1 │  12 │
│ 5195928066127503662 │ 2014-03-17 │ 2014-03-17 01:43:02 │     1926 │         3 │ gruzomoy.sumtel.com.ua │        0 │         2 │  35 │
└─────────────────────┴────────────┴─────────────────────┴──────────┴───────────┴────────────────────────┴──────────┴───────────┴─────┘

        You may notice that I specified modifiers after the table name. I do this to make sure the data is fully merged and I only get one row per session.final

        Often used in the ClickHouse engine, since it allows using instead of ( usually more details in the docs ). Using this approach, you can have a few rows per session in case of updates, which are then merged in the background by the system. Using modifiers, we force the process.CollapsingMergeTreeinsertsupdatesfinal

        We can perform two simple queries to see the difference.

SELECT
    uniqExact(VisitID) AS unique_sessions,
    sum(Sign) AS number_sessions, 
    -- number of sessions after collapsing
    count() AS rows
FROM datasets.visits_v1

┌─unique_sessions─┬─number_sessions─┬────rows─┐
│         1676685 │         1676581 │ 1680609 │
└─────────────────┴─────────────────┴─────────┘

SELECT
    uniqExact(VisitID) AS unique_sessions,
    sum(Sign) AS number_sessions,
    count() AS rows
FROM datasets.visits_v1
FINAL

┌─unique_sessions─┬─number_sessions─┬────rows─┐
│         1676685 │         1676721 │ 1676721 │
└─────────────────┴─────────────────┴─────────┘

        Using has its own drawbacks in performance. You can find more information about it in the documentation .final

3. How to ensure data quality?

        Verifying that there are no missing or duplicate events is very simple. You can find many ways to detect anomalies in time series data, from naive methods (e.g. number of events within +20% or -20% compared to the previous week) to ML with libraries like  Prophet  or  PyCaret  .

        Data consistency is a tricky task. As I mentioned before, web analytics services track a lot of information about your customers' behavior on your website. They document hundreds of parameters, and we need to make sure all of these values ​​look valid.

        Parameters can be numeric (duration or number of pages seen) or categorical (browser or operating system). For values, we can use statistical criteria to ensure that the distribution remains constant - for example, the Kolmogorov-Smirnov test .

        So, after looking at best practices, my only question is how to monitor the agreement of categorical variables, time to discuss it.

4. Categorical variables

        Let's take a browser as an example. We have unique values ​​for 62 browsers in our data.

SELECT uniqExact(UserAgent) AS unique_browsers
FROM datasets.visits_v1

┌─unique_browsers─┐
│              62 │
└─────────────────┘

SELECT
    UserAgent,
    count() AS sessions,
    round((100. * sessions) / (
        SELECT count()
        FROM datasets.visits_v1
        FINAL
    ), 2) AS sessions_share
FROM datasets.visits_v1
FINAL
GROUP BY 1
HAVING sessions_share >= 1
ORDER BY sessions_share DESC

┌─UserAgent─┬─sessions─┬─sessions_share─┐
│         7 │   493225 │          29.42 │
│         2 │   236929 │          14.13 │
│         3 │   235439 │          14.04 │
│         4 │   196628 │          11.73 │
│       120 │   154012 │           9.19 │
│        50 │    86381 │           5.15 │
│        79 │    63082 │           3.76 │
│       121 │    50245 │              3 │
│         1 │    48688 │            2.9 │
│        42 │    21040 │           1.25 │
│         5 │    20399 │           1.22 │
│        71 │    19893 │           1.19 │
└───────────┴──────────┴────────────────┘

        We could monitor each browser's share individually as a numeric variable, but in this case we'll be monitoring at least 12 time series for a field,. As everyone who has done an alert at least once knows, the The fewer variables, the better. When tracking many parameters, there are a lot of false positive notifications to deal with.UserAgent

        So I started thinking of a metric that would show the difference between distributions. The idea is to compare the browser share of now() and before(). We can choose the previous period according to the granularity: T2T1

  •         For minute data - you can look a bit,
  •         For daily data - it's worth looking at the day before the week to account for weekly seasonality,
  •         For monthly data - you can view data from one year ago.

        Let's see an example below.

        My first thought was to look at a heuristic metric similar to the L1 norm used in machine learning ( more details ).

        For the example above, this formula will give us the following result — 10%. Actually, this metric makes sense - it shows the smallest share of distribution events that browsers have changed.

        Afterwards, I discussed this topic with my boss, who has a lot of experience in data science. He suggested that I look at the Kullback-Leibler or Jensen-Shannon divergence, since this is a more efficient way to calculate the distance between probability distributions.

        If you don't remember these indicators or have never heard of them before, don't worry, I'm in your shoes. So I googled the formulas ( this article explains the concepts thoroughly) and the calculated values ​​for our example.

import numpy as np
  
prev = np.array([0.7, 0.2, 0.1])
curr = np.array([0.6, 0.27, 0.13])

def get_kl_divergence(prev, curr):
    kl = prev * np.log(prev / curr)
    return np.sum(kl)

def get_js_divergence(prev, curr): 
    mean = (prev + curr)/2
    return 0.5*(get_kl_divergence(prev, mean) + get_kl_divergence(curr, mean))

kl = get_kl_divergence(prev, curr)
js = get_js_divergence(prev, curr)
print('KL divergence = %.4f, JS divergence = %.4f' % (kl, js))

# KL divergence = 0.0216, JS divergence = 0.0055

        As you can see, the distances we calculate vary widely. So now that we have (at least) three ways to calculate the difference between previous and current browser shares, the next question is which way to choose for our monitoring task.

5. The winner is...

        The best way to estimate the performance of different methods is to see how they perform in real life. To do this, we can simulate anomalies in the data and compare the effects.

        There are two common anomalies in the data:

  • Data Loss: We started losing data from one of the browsers, and all the other browsers had an increasing share
  • Changed: When traffic from one browser starts flagging for another. For example, 10% of the Safari events we see today are undefined.

        We can get the actual browser share and simulate these exceptions. For simplicity, I'm going to group all browsers with a share below 5% into groups.browser = 0

WITH browsers AS
    (
        SELECT
            UserAgent,
            count() AS raw_sessions,
            (100. * count()) / (
                SELECT count()
                FROM datasets.visits_v1
                FINAL
            ) AS raw_sessions_share
        FROM datasets.visits_v1
        FINAL
        GROUP BY 1
    )
SELECT
    if(raw_sessions_share >= 5, UserAgent, 0) AS browser,
    sum(raw_sessions) AS sessions,
    round(sum(raw_sessions_share), 2) AS sessions_share
FROM browsers
GROUP BY browser
ORDER BY sessions DESC

┌─browser─┬─sessions─┬─sessions_share─┐
│       7 │   493225 │          29.42 │
│       0 │   274107 │          16.35 │
│       2 │   236929 │          14.13 │
│       3 │   235439 │          14.04 │
│       4 │   196628 │          11.73 │
│     120 │   154012 │           9.19 │
│      50 │    86381 │           5.15 │
└─────────┴──────────┴────────────────┘

        Time to simulate both situations. You can   find all the code on GitHub . For us, the most important parameter is the actual effect - the lost or changed share of events. Ideally, we'd like our metrics to be equal to this effect.

        As a result of the simulation, we obtained two graphs showing the correlation between the fact effect and the distance metric.

        Each point in the graph shows the result of a simulation—the actual effect and the corresponding distance.

        You can easily see that the L1 norm is the best metric for our task since it is closest to the line. Kullback-Leibler and Jensen-Shannon diverge widely and have different levels depending on the use case (which browser is losing traffic).distance = share of affected events

        Such metrics are not suitable for monitoring because you will not be able to specify a threshold that will alert you when more than 5% of your traffic is affected. Furthermore, we cannot easily interpret these metrics, whereas the L1 norm accurately shows the degree of anomaly.

Six, L1 norm calculation

        Now that we know what metric will show us the consistency of the data, the last remaining task is to implement the L1 norm calculation in the database (in our case — ClickHouse).

        We can use the well-known window functions for it.

with browsers as (
    select
        UserAgent as param,
        multiIf(
            toStartOfHour(UTCStartTime) = '2014-03-18 12:00:00', 'previous',
            toStartOfHour(UTCStartTime) = '2014-03-18 13:00:00', 'current',
            'other'
        ) as event_time,
        sum(Sign) as events
    from datasets.visits_v1
    where (StartDate = '2014-03-18')
    -- filter by partition key is a good practice
        and (event_time != 'other')
    group by param, event_time)
select
    sum(abs_diff)/2 as l1_norm
from
    (select
        param,
        sumIf(share, event_time = 'current') as curr_share,
        sumIf(share, event_time = 'previous') as prev_share,
        abs(curr_share - prev_share) as abs_diff
    from
        (select
            param,
            event_time,
            events,
            sum(events) over (partition by event_time) as total_events,
            events/total_events as share
        from browsers)
    group by param)

┌─────────────l1_norm─┐
│ 0.01515028932687386 │
└─────────────────────┘

        ClickHouse has very powerful array functions, and I have used them for a long time before supporting window functions . So I want to show you the power of this tool.

with browsers as (
    select
        UserAgent as param,
        multiIf(
            toStartOfHour(UTCStartTime) = '2014-03-18 12:00:00', 'previous',
            toStartOfHour(UTCStartTime) = '2014-03-18 13:00:00', 'current',
            'other'
        ) as event_time,
        sum(Sign) as events
    from datasets.visits_v1
    where StartDate = '2014-03-18' -- filter by partition key is a good practice
        and event_time != 'other'
    group by param, event_time
    order by event_time, param)
select l1_norm 
from
    (select
        -- aggregating all param values into arrays
        groupArrayIf(param, event_time = 'current') as curr_params,
        groupArrayIf(param, event_time = 'previous') as prev_params,
        
        -- calculating params that are present in both time periods or only in one of them
        arrayIntersect(curr_params, prev_params) as both_params,
        arrayFilter(x -> not has(prev_params, x), curr_params) as only_curr_params,
        arrayFilter(x -> not has(curr_params, x), prev_params) as only_prev_params,
        
        -- aggregating all events into arrays
        groupArrayIf(events, event_time = 'current') as curr_events,
        groupArrayIf(events, event_time = 'previous') as prev_events,
        
        -- calculating events shares
        arrayMap(x -> x / arraySum(curr_events), curr_events) as curr_events_shares,
        arrayMap(x -> x / arraySum(prev_events), prev_events) as prev_events_shares,
        
        -- filtering shares for browsers that are present in both periods
        arrayFilter(x, y -> has(both_params, y), curr_events_shares, curr_params) as both_curr_events_shares,
        arrayFilter(x, y -> has(both_params, y), prev_events_shares, prev_params) as both_prev_events_shares,
        
        -- filtering shares for browsers that are present only in one of periods
        arrayFilter(x, y -> has(only_curr_params, y), curr_events_shares, curr_params) as only_curr_events_shares,
        arrayFilter(x, y -> has(only_prev_params, y), prev_events_shares, prev_params) as only_prev_events_shares,
        
        -- calculating the abs differences and l1 norm
        arraySum(arrayMap(x, y -> abs(x - y), both_curr_events_shares, both_prev_events_shares)) as both_abs_diff,
        1/2*(both_abs_diff + arraySum(only_curr_events_shares) + arraySum(only_prev_events_shares)) as l1_norm
    from browsers)

┌─────────────l1_norm─┐
│ 0.01515028932687386 │
└─────────────────────┘

        This approach might be handy for people with a pythonic mind. With persistence and creativity, any logic can be written using array functions.

7. Alerting and monitoring

        We have two queries that show us fluctuations in the share of browsers in our data. Data of interest can be monitored using this method.

        The only thing left is to align with the team on the alert threshold. I usually look at historical data and previous anomalies to get some initial levels and then keep tweaking them with new information: false positive alerts or missed anomalies.

        Also, while implementing monitoring, I came across some nuances that I would like to briefly cover:

  • For example, there are parameters in the data that don't make sense in monitoring, or, so choose wisely which parameters to include.UserIDStartDate
  • You may have parameters with high cardinality. For example, in web analytics, the data has over 600K unique values. Computing metrics for it can consume resources. Therefore, I would suggest either storing these values ​​(e.g. taking domain or TLD), or monitoring only the top-level values ​​and grouping the other values ​​into a separate group "Other".StartURL
  • You can use the same framework for values ​​using buckets.
  • In some cases, data are expected to change significantly. For example, if you are monitoring the application version field, you will be alerted after each version is released. Such events help ensure your monitoring is still there :)

 

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132338528