Linux网桥中的伪造路由

内核中在新建一个网桥时，为其分配一个伪造的路由表项（fake_rtable）。之所以是伪造，因为并不是通过查询FIB路由表而得出，而是手动创建并且不能用于路由数据包，其操作函数集中大部分函数为空，目前仅fake_mtu函数会返回一个相关网络设备device的mtu值，其它函数诸如fake_redirect等都为空。

static unsigned int fake_mtu(const struct dst_entry *dst)
{
    return dst->dev->mtu;
}
static struct dst_ops fake_dst_ops = {
    .update_pmtu    = fake_update_pmtu,
    .redirect   = fake_redirect,
    .cow_metrics    = fake_cow_metrics,
    .neigh_lookup   = fake_neigh_lookup,
    .mtu        = fake_mtu,
};

伪路由初始化

网桥的伪路由由函数br_netfilter_rtable_init初始化，可以看到只是初始化了路由dst的MTU为1500（PMTU）。代码中注释对伪造路由做了如下解释：

* Initialize bogus route table used to keep netfilter happy.
* Currently, we fill in the PMTU entry because netfilter
* refragmentation needs it, and the rt_flags entry because
* ipt_REJECT needs it. Future netfilter modules might
* require us to fill additional fields.
*/
/*
* 伪造路由表项可使得netfilter模块正常工作。
* 当前，由于netfilter的重新分片功能需要PMTU，ipt_REJECT操作需要使用rt_flags变量，所以只初始化了此两者。
* 未来如果netfilter模块需要使用到路由表项的其它字段再行增加。
*/
static const u32 br_dst_default_metrics[RTAX_MAX] = {
[RTAX_MTU - 1] = 1500,
};

有以下代码可见，路由rtable的成员rt_flags变量没有进行显示的初始化，即rt_flags等于0。

void br_netfilter_rtable_init(struct net_bridge *br)
{
    struct rtable *rt = &br->fake_rtable;

    atomic_set(&rt->dst.__refcnt, 1);
    rt->dst.dev = br->dev;
    rt->dst.path = &rt->dst;
    dst_init_metrics(&rt->dst, br_dst_default_metrics, true);
    rt->dst.flags   = DST_NOXFRM | DST_FAKE_RTABLE;
    rt->dst.ops = &fake_dst_ops;
}

伪路由的使用

首先来看一下rt_flags的使用，具体在文件net/ipv4/netfilter/ipt_REJECT.c中。也就是iptables的reject操作，由此可见，rt_flags需要在打开了网桥调用iptables功能之后才会有效。比如在网桥模式下，配置拒绝源IP地址在192.168.1.0/24网段的数据包，规则如下：

iptables -A INPUT -p ALL -s 192.168.1.0/24 -j REJECT --reject-with icmp-host-prohibited

匹配此规则的数据包，将会被丢弃，并且回复一个icmp-host-prohibited数据包。netfilter使用nf_send_unreach函数发送此数据包，其最终调用icmp_send处理。此函数中需要使用我们之前初始化的rt_flags的值，具体如下代码：

void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
{
    if (rt->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
        goto out;
}

再来看初始化了的dst的PMTU的使用。同样在ipt_REJECT.c文件中，如果配置了以下的iptables规则，在匹配到源地址为192.168.1.0/24网段的数据包后，系统发送TCP reset报文给源端：

sudo iptables -A INPUT -p tcp -s 192.168.1.0/24 -j REJECT --reject-with tcp-rst

内核调用函数nf_send_reset实现。其需要访问dst的pmtu验证新创建的nskb的长度是否合法。ip4_dst_hoplimit会访问到dst metric的RTAX_HOPLIMIT字段，此字段并未做初始化，其值为零，ip4_dst_hoplimit会采用系统默认的hoplimit（sysctl_ip_default_ttl）。

void nf_send_reset(struct sk_buff *oldskb, int hook)
{
    skb_dst_set_noref(nskb, skb_dst(oldskb));

    niph = nf_reject_iphdr_put(nskb, oldskb, IPPROTO_TCP,
                   ip4_dst_hoplimit(skb_dst(nskb)));

    if (nskb->len > dst_mtu(skb_dst(nskb)))
        goto free_nskb;
}

另外，在netfilter调用br_nf_ip_fragment发送数据包时，需要使用dst的mtu判断是否要进行分片。

static int br_nf_ip_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, ...)
{
    unsigned int mtu = ip_skb_dst_mtu(skb);
    struct iphdr *iph = ip_hdr(skb);

    if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
             (IPCB(skb)->frag_max_size &&
              IPCB(skb)->frag_max_size > mtu))) {
        IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
        kfree_skb(skb);
        return -EMSGSIZE;
    }
    return ip_do_fragment(sk, skb, output);
}

内核版本

linux-3.10.0