Matlab 提取网页信息保存到Excel（正则表达式）

这次扯个题外话，做这个完全是为帮同学获取快递网点信息

要求：得到上海市天天快递所有网点的信息 title_colum= {'所在地区','公司名称','地址','联系电话','派送范围','Link'};最后要将结果放到Excel中,具体如下

下图是某一个网点的信息

一共有78个网点，链接地址http://www.ttkdex.com/webdistributed/5.html

例如：

方案：十八般武艺，各显神通，

（1）手工复制：每个网页都得打开一次，复制四五次，还好不是特别多，但是韵达快递就多如牛毛了....

（2）HTML+JS+Active+...，哈哈，这个不太熟，还是用习惯的东西来做吧，后来知道，其实可以直接得到table数据的

（3）Matlab + 正则表达式，暂且用这个方案，Python应该更方便

基本原理：网页加载进来之后，就是一个很长很长的字符串，让后用正则表达式匹配你要的信息就好了

下面是程序实现：

url_main ='http://www.ttkdex.com/webdistributed/5.html';
% <b class="corinfo"><a href="webinfo/021953.html" >上海宝山吴淞公司</a></b>
ex_main = '<b class="corinfo"><a href="(.*?)" >(.*?)</a></b>';
% <td class="td_l">所在地区</td>
%         <td class="td_r">上海  崇明县</td>
%         <td class="td_r">上海市  浦东新区</td>
% 
%         <td class="td_l">业务电话</td>
%         <td class="td_r">021-39368198</td>
%     </tr>
%     <tr>
%         <td class="td_l">公司地址</td>
%         <td class="td_r">崇明县城桥镇三沙洪路338号 </td>
%     </tr>
%     <tr>
%         <td class="td_l">派送范围</td>
%         <td class="td_r">城东风农场长江农场新河镇庙镇 </td>
% <td class="td_r"><P>黄浦区,  卢湾区,  徐汇区, ,  奉贤区,  崇明县 <BR></td>  
ex_sub= {   '<td class="td_r">上海.*? (.*?)</td>'...
            '<td class="td_l">业务电话<.*?"td_r">(.*?)</td>'...
            '<td class="td_l">公司地址<.*?"td_r">(.*?)</td>'...
            '<td class="td_l">派送范围<.*?"td_r">(.*?)</td>'
         }
outExcel = 'C:\Users\Chen\Desktop\KuaiDi.xlsx';
% 指定序号
title_colum = {'所在地区','公司名称','地址','联系电话','派送范围','Link'};
sheet = '天天';
% 实际序号
%   1        2         3         4        5        6
%  Link   公司名称   所在地区  联系电话   地址   派送范围   
order = [3 2 5 4 6  1];
% http://www.ttkdex.com/webdistributed/webinfo/021211.html#title
% "webinfo/021220.html"
url_prefix = 'http://www.ttkdex.com/webdistributed/';
url_postfix='#title';
url_start =1;

%%
disp('读取主页信息...')
index=urlread(url_main,'get','','Charset','GB18030');
[tok ] = regexp(index, ex_main,'tokens');
informations = cat(1,[],tok{:}); % 记得{:}
%%
infor =[];
if ~isempty(ex_sub)
    disp('读取子网页...');
    infor = cell(size(tok,2),size(ex_sub,2));
    for i=1:size(tok,2)
        url = informations{i,1};
        url = [ url_prefix url(url_start:end) url_postfix];
        informations{i,1}=url;   % 1 url
        disp(['读取第' num2str(i) '个子网页' ':' url '...']);
        while(1)%不达目的，誓不罢休
            try
                [content status]=urlread(url,'get','','Charset','GB18030');
                if status  %也可以直接break
                    break;
                end
            catch exception
                pause(0.2);
            end
        end
        inf = regexp(content, ex_sub,'tokens');
        %    inf =cat(2,[],inf{:});  %空的时候会移位……
        %    inf = cat(2,[],inf{:});
        %    infor(i,:) = inf;
        for j=1:size(inf,2)
            if ~isempty(inf{j})
                infor(i,j)=inf{j}{:};
            end
        end
    end
end
%%
informations = [informations infor];
informations(:,1:end) = informations(:,order);
disp('写入Excel');
xlswrite(outExcel, title_colum,sheet,['A1:' char('A'+size(informations,2)-1) '1' ]);
xlswrite(outExcel, informations,sheet,['A2:' char('A'+size(informations,2)-1)  num2str(size(informations,1)+1) ]);

你需要修改的地方：outExcel = 'C:\Users\Chen\Desktop\KuaiDi.xlsx'; （改成你想要的路径就好了）

跑起来，是酱紫的

结果:就是前面的那个Excel,

什么？行距不一样，呃，这个用Excel设置一下就可以啦（全选->自动换行）

相关知识：

1）html这里你只需要知道怎么看网页源代码就可以了，右键->查看网页源代码

2）正则表达式详解非常详细，我基本上只用了，标记(tokens)，最为方便了

3）matlab打开网页(自己找下吧，我也不记得我参考了哪些资料了)

4）matlab cell（cell的操作比较复杂）

注意事项：

1）我用的matlab是2014,貌似13之前的版本不支持中文网页，要稍微修改下，其实也很简单，网上查一下

2）有些网页是utf-8的字符集，有些网页是GB18030

3）细心的读者应该看到提取的信息中还有一些HTMl标记，这个吗，我是直接在Excel里面查找，全部替换

4）上述代码是我模板化之后的，初始版本是酱紫的，更容易理解，不过效率更低，获取的是韵达快递的信息，上面看不懂的话，可以看这个

outExcel = 'C:\Users\Chen\Desktop\KuaiDi.xlsx';
%% 
disp('读取主页信息')
url ='http://www.chakd.com/yd/cs/chengshi021.htm';
% sourcefile=urlread(url,'get','');
sourcefile=urlread(url,'get','');
%     <td class="ydwd"><a href="../wd/ydwd199100.htm">199100</a></td>
%     <td class="ydwd"><a href="../wd/ydwd199100.htm">总公司业务K部</a></td>
ex1 = '<td class="ydwd"><a href="(.*?)">(\d{6})</a><.*?<td class="ydwd"><.*?>(.*?)</a></td>';
[tok match] = regexp(sourcefile, ex1,'tokens', 'match');

%%
disp('读取子网页');
for i=1:size(tok,2)
    %% 韵达
    url_prefix = 'http://www.chakd.com/yd';
    url = tok{1,i}{1};
    url = [ url_prefix url(3:end)];
    disp(['读取第' num2str(i) '个子网页' ':' url]);
    content=urlread(url,'get','','Charset','GB18030');   
%     content=urlread(url,'get','');    
    % charge
    ex_charge ='<td class="table_title">负责人<.*?<td class="table_content">(.*?)</td>';
    charge= regexp(content, ex_charge,'tokens');
    tok{1,i}{1}=charge;
    % telephone
    ex_phone ='<td class="table_title">联系电话<.*?<td class="table_content">(.*?)</td>';
    phone= regexp(content, ex_phone,'tokens');
    tok{1,i}{4}=phone;
    % address
    ex_addr ='<td class="table_title">地址<.*?<td class="table_content">(.*?)</td>';
    addr = regexp(content, ex_addr,'tokens');
    tok{1,i}{5}=addr;
    % range
    ex_range ='<td class="table_title">派送范围<.*?<td class="table_content">(.*?)</td>';
    range = regexp(content, ex_range,'tokens');
   % disp(range{:});
    tok{1,i}{6}=range;
end
%%
disp('写入Excel');
 xlswrite(output, tok{1,i},'韵达',['A' num2str(i) ':F' num2str(i)]);
for i=1:size(tok,2)    
    xlswrite(output, tok{1,i},'韵达',['A' num2str(i) ':F' num2str(i)]);
end

Matlab 提取网页信息保存到Excel（正则表达式）

猜你喜欢