[Space-time series] OpenCastKit, a large global weather prediction model, is officially open source

e8f7ee70066a45d0367f0f83707b079f.jpeg

OpenCastKit is Magic Square AI's open source AI weather large model toolkit, which includes two weather SOTA models FourCastNet and GraphCast, and provides complete parameters open source. Users can easily output global high-resolution weather forecast results through this large model.

Open

CastKit

Using AI methods to improve modern numerical weather prediction (NWP) has received widespread attention in the past two years, such as FourCastNet released by Nvidia, GraphCast released by DeepMind, and the Pangea meteorological model released by Huawei . The high-resolution Integrated Forecast System (IFS) of the Center for Weather Forecasting (ECMWF) has achieved good results in comparison.

Based on this, we recently reproduced and integrated these works and contributed these results to the open source community. We built a new global AI weather forecasting project - OpenCastKit - based on the FourCastNet and GraphCast papers .

c8404d7c18231b92a41ad595bffa21af.png

Project address: https://github.com/HFAiLab/OpenCastKit

This project provides a powerful open-source weather model and parameters trained on ERA5 data, which can generate global high-resolution weather forecasts. Specifically, it contains:

  • A unified data processing tool that extracts ERA5 data and features and organizes them into a high-performance training data format frecord ;

  • Based on hfai operator and hfreduce parallel communication optimized FourCastNet model source code and GraphCast model source code for community research and optimization;

  • Based on 15TB of ERA5 data from 1979 to 2022, the model parameters trained on the Firefly high-performance cluster can be fine-tuned to obtain high-precision prediction results.

At the same time, we launched a daily updated HF-Earth to show the global prediction effect of the meteorological large model output:

4971a93cee196a8c09c693790236134a.png

Demo address: https://www.high-flyer.cn/hf-earth/

After a period of testing, it can be seen that the large AI meteorological model is effective in predicting typhoons, extreme precipitation and other events, and it can play a certain role in the analysis of long-term climate change. It is hoped that on the basis of this open source project, a more powerful AI weather application can be built.

data set

The European Center for Medium-Range Weather Forecasts (ECMWF) provides a publicly available comprehensive dataset, ERA5, which combines physical model data with observations from around the world to form a globally complete and consistent It provides multiple global meteorological index data, including temperature, wind volume, precipitation, hydrology, air pressure, etc., for learning from various weather forecast models.

Official address: https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5

FourCastNet and GraphCast use different scales of ERA5 data for training, resulting in different prediction effects. The former only uses 20 relevant meteorological indicators, including temperature, wind speed, relative humidity and some near-surface variables at 4 different geopotential heights, which are aimed at early warning of extreme weather and natural disasters; while the latter uses more comprehensive data , which contains 37 meteorological indicators at different geopotential heights and 5 surface meteorological indicators, a total of 227 indicators, which aims to make a more comprehensive assessment and forecast of meteorological changes.

In this regard, we have summarized these data and used the hfai.datasets tool for management and optimization. The original data is transformed into the pattern of " X t-1 , X t → X t+1 " through feature processing , and saved through the high-performance training sample format frecord, so that efficient parallel training can be carried out in the firefly cluster. For more information, you can browse the hfai dataset repository.

Model building and optimization

For global weather forecasting at 0.25° resolution, FourCastNet employs an adaptive Fourier neural operator AFNO, while GraphCast employs a graph neural network. The former is computationally efficient and can flexibly and scalablely model dependencies across spaces and between different indicators; the latter captures the impact of climate factors such as the "butterfly effect" in more detail by building connections between nodes. The former can perform data parallelism on a small batch size to speed up training, while the latter has larger message passing parameters between sphere nodes, and requires pipeline parallelism (or model parallelism) transformation to achieve complete training of the model.

2c43a0024a8cb901ef880eef894322a4.png

FourCastNet model structure

367f9663e42dbc268464b15de7b58778.png

GraphCast model structure

Here we use the self-developed haiscale high-performance parallel training tool library to reproduce and optimize the two models. For FourCastNet, we use haiscale.ddp or haiscale.fsdp for data parallel optimization. In the experiment, we used a small batchsize to reproduce the effect of the paper; for GraphCast, the complete parameters cannot be packed into a single graphics card, so for different links , such as the grid node and mesh node in the sphere for message passing, we need to split it and distribute it on different graphics cards, and connect different links in series through haiscale.pipeline to achieve parallel training of the model. details as follows:

FourCastNet Data Parallel

The training of the FourCastNet model includes three parts : pretrian , finetune and precipitation . The model adopts the progressive type, that is, takes X t as input and predicts the next step X t+1 . A training output multi-step, compared with the true value to calculate the loss. As shown in the following pseudo code:

from hfai.datasets import ERA5
from haiscale.ddp import DistributedDataParallel
from torch.utils.data.distributed import DistributedSampler


model = FourCastNet(args).cuda()
model = DistributedDataParallel(model)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)


data = ERA5(split='train')
sampler = DistributedSamper(data, shuffle=True)
dataloader = data.loader(args.batch_szie, sampler=sampler, num_workers=8, pin_memory=True, drop_last=True)


# training ...
for step, (xt0, xt1, xt2, pt2) in enumerate(dataloader):
    xt1_pred = model(xt0)            # pretrain
    xt2_pred = model(xt1_pred)          # finetune
    pt2_pred = model(xt2_pred, precip=True)    # preciptation


    pretrain_loss = criterion(xt1_pred, xt1)
    finttune_loss = criterion(xt2_pred, xt2)
    precip_loss = criterion(pt2_pred, pt2)
    
    # optim ...


# stop hfreduce
model.reducer.stop()

By default, haiscale.ddp uses hfreduce for communication optimization. We can also use the optimization operator and add a line of code model = hfai.nn.to_hfai(model)  to further speed up. On the Firefly cluster, we use 96 A100s for data parallel acceleration, and it takes about 16 to 17 hours to basically complete the overall training of FourCastNet.

GraphCast Data Parallel

Different from FourCastNet, GraphCast only has one backbone model, which also adopts a progressive model, but uses X t-1 , X t , T, C, and G as inputs to predict the next step X t+1 . Here T and C represent timestamp information and geographic location information, and G represents the constructed sphere Graph information. As shown in the following pseudo code:

from hfai.datasets import ERA5
from haiscale.ddp import DistributedDataParallel
from haiscale.pipeline import PipeDream, make_subgroups, partition
from torch.utils.data.distributed import DistributedSampler


dist.init_process_group(...)
torch.cuda.set_device(local_rank)
rank, world_size = dist.get_rank(), dist.get_world_size()


dp_group, pp_group = make_subgroups(pp_size=pp_size)
dp_rank, dp_size = dp_group.rank(), dp_group.size()
pp_rank, pp_size = pp_group.rank(), pp_group.size()


model = GraphCast_sequentail(args)
model = partition(model, pp_group.rank(), pp_group.size(), balance=[1,1,1,1,1,1,1,1])
model = DistributedDataParallel(model.cuda(), process_group=dp_group)
model = PipeDream(model, args.chunks, process_group=pp_group)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)


data = ERA5(split='train')
sampler = DistributedSamper(data, num_replicas=dp_size, rank=dp_rank, shuffle=True)
dataloader = data.loader(args.batch_szie, sampler=sampler, num_workers=8, pin_memory=True, drop_last=True)
earth_graph = generate_graph(args)


# training ...
for step, (xt0, xt1, xt2) in enumerate(dataloader):
    loss = model.forward_backward(xt0, xt1, earth_graph, criterion=criterion, labels=(xt2,))
    
    # optim ...


# synchronize all processes
model.module.reducer.stop()
dist.barrier()

When using haiscale.pipeline for pipeline parallel training, we need to split the model in advance and connect the model through haiscale.SequentialModel  . At the same time, haiscale provides a unified forward_backward interface for unified input and output of samples and labels. On the Firefly cluster, we use 256 A100s for model parallel acceleration (single node with 8 cards for pipeline parallelism and 32 nodes for data parallelism). It takes about 3 days to basically complete the overall training of GraphCast.

For more details about the code, you can visit the project address to read the source code.

forecast result

Referring to the evaluation method in the paper, we recursively output the forecast results for multiple days in the future, compare them with the real values, and compare the forecast effects of different AI meteorological large models through the error growth curve. As shown below:

8ba76c86f9fd3640fe167fb2a86197ea.png

It can be seen that in the 14-day medium-range weather forecast test, whether it is GraphCast or FourCastNet, the recursive forecast leads to a gradual increase in error over time. In terms of overall error, GraphCast takes into account factors of geographical time and geographic location, and its prediction error is smaller than that of FourCastNet. Inspired by this, we added time and geographic information to FourCastNet for model training (FourCastNet+), and found that the prediction error of the final model output was almost consistent with that of GraphCast.

Let's show the prediction effect of OpenCastKit by continuously outputting the forecast for 14 days starting from June 22, 2022:

5794b01a96db618fd9462c95f6652f3d.gif

FourCastNet Temperature Prediction

d45d607ecb751e65a6a66f77d110e88c.gif

GraphCast Temperature Prediction

605571539ea2017a8f84e6fdeeda05a2.gif

FourCastNet Wind Forecasting

314c330631d2010cc4a6c8c2726c51b1.gif

GraphCast Wind Forecasting

6d030ba80179aa8cf0ec10110e342d7a.gif

real temperature

a74b52b8dbc3d57b9a5fdee929478cfd.gif

real wind

It can be seen that both FourCastNet and GraphCast can make more accurate predictions on the evolution of wind force and temperature. Among them, GraphCast is relatively closer to the real situation, including richer and more consistent detailed textures of meteorological indicators, and the prediction of the two typhoon tracks on the southeast coast of my country starting on June 30.

推荐阅读:

我的2022届互联网校招分享

我的2021总结

浅谈算法岗和开发岗的区别

互联网校招研发薪资汇总
2022届互联网求职现状,金9银10快变成铜9铁10!!

公众号:AI蜗牛车

保持谦逊、保持自律、保持进步

发送【蜗牛】获取一份《手把手AI项目》(AI蜗牛车著)
发送【1222】获取一份不错的leetcode刷题笔记

发送【AI四大名著】获取四本经典AI电子书

Guess you like

Origin blog.csdn.net/qq_33431368/article/details/130211976