Powerful! More than a hundred Excel table turned out to be less than 3 seconds to deal with the end use Python?

Case Background

In another parallel world, there is a focus on outdoor sports giant. Since it is a giant, to more intimate, we called him the bulk of it. There's the bulk of the 20 brands , these brands relate to 128 categories (industry segments), involving a wide range of astounding, indeed everywhere.

Z is a parallel world of small data analyst of this giant, today first came to the company received a demand - be sure to filter out before work nearly TOP5 brand with annual sales total and the corresponding sales.

Nearly a year? TOP5?

WOC, such a simple demand computing needs? Direct sequencing a row is not good enough.

Also one day, and not be anxious, the first cup of coffee, take a look at the news.

Blink of an eye, the time came to 17:30, small Z that can be switched on today's needs, and also done after a simple analysis, should be able to catch the whole point off work at 18:00.

When he opened his colleagues shared spreadsheet file, he did not realize the despair that so far, and so close.

Business colleagues sent a total of 128 tables, every table corresponds to a data segment of the industry, what kinds of outdoor clothing, fishing equipment, life-saving equipment like everything.

Each table on a monthly dimension (2018 - September August 2019, nearly a year) record date for each brand, visitor, guest list, transformation, belongs to the category (sub-industry) and other data:

Note: Do not ask why the table data is so wonderful, because in the parallel world, is to wayward, after all, complex tables to reflect the efficient Python

Small Z began to plan, final demand is to filter out the brand before the nearly 5 ranking in annual sales, the stall data on a separate table subtotals, can get the sub-sector of the brand's sales, want to get the sum of sales for all industries, was subtotals 128 times, and finally merge again to 128 results.

"This task looks daunting, but the test is mainly physical." Small Z glance "see through" the nature of things. At the same time jumped to mind the characters, "Red Army is not afraid of difficult expedition" a few bright red. Then he put on headphones, the band opened the Tang Dynasty, "The Internationale", in a double blessing buff, and began the expedition table.

It really is a process the data of the players, the small Z rapid beating right index finger on the mouse, at a speed of 90 seconds to promote a form of madness. At that rate, regardless of the value of fatigue drag on speed, about 3.2 hours will be able to complete the task.

Internationale loop to the first 10 times, some small Z discouraged, the first 20 times, began to feel desperate.

Just about to give up on the occasion, he remembered the Pan Python guru (Pandas), although recently learned is not very skilled, but at the last minute, a faint light in the darkness, and that is the only hope, decided to use small Z Pandas to try to resolve the problem.

He understood that the core problem to solve batch with Python, is to sort out and solve a single problem, then the batch cycle.

Single form processing

First, import module, open a single table:

Next, is to summarize different brand sales in this segment of the industry, we want to summarize is the brand for nearly a year - Sales (September 2018 August 2019), the first look at the date is correct:

Summary of sales was about, found a small Z field is not sales, but sales by the number of visitors is the conversion rate used to calculate the product of three customer price:

To summarize sales by brand, nearly a year to get the total sales of the brand:

Here is a detail, and ultimately small Z to summarize all the sub-sectors of sales, industry sales for the individual, a distinction should be added to the label to prevent the cover, and open when the file name, and has a natural anti distinguish coverage advantage, but be careful to remove the file suffix.

OK, a single form processing is completed, we put a series of operations by extension can be.

Batch execution cycle

Small Z with os.listdir way to traverse the file name, batch cycle access and process files, while the introduction of time time, going to look at the face of 128 meter, Python can perform these operations in the end much faster than the manual:

WOC, the whole process at one go, less than 3 seconds, 0.02 seconds average a table! Really fragrant!

In order to ensure that the data is normal, for a preview:

It seems very strange string of sales, actual sales are pandas given a free hand to form into a number of scientific method in mind to show, to restore value, need to change the original settings:

OK, whether it is customary or legal, we have been promising results - the past year sales TOP5 brands and their corresponding sales. From the data results, 20 brands under the full flowering of the bulk of the company to pioneer brand 5 year sales of up to 1.226 billion, ranking the final body mass brand has reached 979 million yuan, the average single-brand sales of 1.085 billion yuan.

Note: Case 128 complete source data and code, plus group 784,758,214 can get.

to sum up

In this paper, a simple and complex scenes cut, simple demand itself is very simple, but it is complex form the basis of the data involved are many and complex. Easy to understand the code and logic itself, mainly in order to throw a brick, knocking thinking barriers batch processing forms, to elicit comrades practice, where appropriate scene using Python to simplify jade. , It is much more interesting to try and explore their own analysis can form a total of 128 cases.

Published 62 original articles · won praise 3 · Views 1403

Guess you like

Origin blog.csdn.net/NNNJ9355/article/details/103864943