DELPHI read the page source and gets the string

When it comes to web page collection, we usually thought to steal online data, then the data collected to hang himself go online. In fact, it can also be collected as the company's reference data, or the data collected do business with their company and contrast.
Current page collection are mostly multi-3P code (3P ie ASP, PHP, JSP). With most representatives on the move-technology companies in BBS news gathering system, and the spread of the Internet Sina news gathering systems are used ASP program to use, but the speed is not very good in theory. If you try to use multiple threads acquisition of other software is not faster? The answer is yes. With DELPHI, VC, VB, JB can, PB seems to be more well done. DELPHI use the following page to interpret the data collected.
A simple news gathering
news gathering is the most simple, as long as identify the title, subtitle, author, source, date, news subjects, paging it. Prior to collection sure to get the contents of a web page, so join idHTTP Control (Clients panel indy) in DELPHI, and then get the contents of a web page with idHTTP1.GET method, the following statement:
function the Get (aURL: String): String; overload;
AURL parameter is string type, specify a URL string. Function return type is string, returns the HTML page source. For example, we can call:
tmpStr: = idHTTP1.Get ( 'http://www.163.com');
After the call is successful, tmpstr variable is stored in NetEase home page code.
Next, talk about the interception of data. Here, I define such a function:
function TForm1.GetStr (strSource, StrBegin, StrEnd: String): String;
var
in_star, in_end: Integer;
the begin
in_star: = AnsiPos (strbegin, strSource) + length (strbegin);
in_end: = AnsiPos (strend, strSource);
Result: = Copy (strSource, in_sta, in_end-in_star);
End;
strSource: String Type, indicates that the HTML source file.
StrBegin: string type, a flag indicating the start of the interception.
StrEnd: string, tag indicates the end of the interception.
StrSource function returns a string from a text to StrSource between StrBegin.
For example:
strTmp: = TForm1.GetStr ( 'A123BCD', 'A', 'the BC');
run, strtmp values: '123'.
About AnsiPos used in the function and copy, are defined by the system, you can find instructions from delphi help file, I simply here, too wordy about:
function AnsiPos (const Substr, S: String): Integer
return Substr in the S position of the first occurrence.
function copy (strsource, in_sta, in_end -in_star): string;
Returns the string strSource, in_end-in_star (integer data) from the end of the string begins in_sta (integer data) to.
With these functions, we can set a variety of markers, to intercept the content of the article want. In the process, more trouble is that we need to set a number of flags to locate an item, make sure to set its start and end markers. For example, to obtain the title of the article on the page, must first view the page code, the code to see some features front and back of the title of the article, these features code to intercept the title of the article.
Let's look at the actual demonstration, suppose you want to collect articles address http://www.xxx.com/test.htm
code:
<HTML>
<head>
<Meta HTTP-equiv = "Content-Language" Content = " CN-zh ">
<Meta name =" GENERATOR "Content =" in the Microsoft FrontPage 5.0 ">
<Meta name =" ProgId "Content =" FrontPage.Editor.Document ">
<Meta HTTP-equiv =" Content-Type "Content = "text / HTML; charset = GB2312">
<title> New page. 1 </ title>
</ head>
<body>
<P align = left = "Center"> <B> article title </ B> </ P>
<Table border = "
<tr> <td width = " 60%"> OF </ TD>
<TD width = "40%"> source </ TD> </ TR>
</ Table>
<P> <font size = "2"> here is the text of the article content. </ font> </ P>
<a href='..new_pr.asp'> Previous </a> <a href='new_ne.asp'> Next </a>
</ body>
</ html>
a first step, we use strSource: = idHTTP1.Get ( 'http://www.xxx.com/test.htm' ); strsource the page code stored in the variable.
Then define strTitle, strAuthor, strCopyFrom, strContent:
strTitle: = getStr (strSource, '<P align = left = "Center"> <B>', '</ B> </ P>'):
strAuthor: = getStr (strSource, '<TR> <TD width = "60%">', '</ TD>'):
strCopyFrom: = getStr (strSource, '<TD width = "40%">', '</ TD> <
/ TR> '): strContent: = getstr (strSource,' <the p-> <font size = "2">, '</ font> </ the p->'):
In this way, you can put the article title, subtitle, author , source, date, and page contents are stored in more variable.
The second step, by way of circulation, open the next page, and get content, added strContent variable.
StrSource: = idHTTP1.Get ( 'new_ne.asp');
strContent: + = strContent getStr (strSource, '<P> <font size = "2">,' </ font> </ P> '):
then there is no judgment Next, if there is to then get the next page.
This completes a simple interception process. We can see from the above code, we use the intercepts are looking for ways to head and tail of the interception of content, if you encounter the head and tail have more than how to do? It seems to be no way, only to find the first one, so before looking for should verify that there is not only one of the front and rear interception of content.
The above content is no verification procedures, for reference, if deemed useful can try.
///////////////////////////////////////
with Delphi download page
http: //dev.csdn .net / develop / article / 61 61609.shtm /
create a new project, put a TIdHTTP control, a TIdAntiFreeze control, a TProgressBar for displaying download progress. Finally, place a TButton to begin our command. Code is as follows:
Procedure TForm1.Button2Click (Sender: TObject);
var
MyStream: TMemoryStream;
the begin
IdAntiFreeze1.OnlyWhenIdle: = False; // make the program is provided responsive.
MyStream: = TMemoryStream.Create;
the try
// download a ZIP file of my site
IdHTTP1.Get ( 'http://www.138soft.com/download/Mp3ToExe.zip',MyStream);
the except // INDY controls generally use this try..except structure.
Showmessage ( 'network error!');
MyStream.Free;
the Exit;
End;
MyStream.SaveToFile ( 'C: \ Mp3ToExe.zip');
MyStream.Free;
Showmessage ( 'the OK');
End;
Procedure TForm1.IdHTTP1WorkBegin (Sender : TObject; AWorkMode: TWorkMode;
const AWorkCountMax: Integer);
the begin
ProgressBar1.Max: = AWorkCountMax;
ProgressBar1.Min: = 0;
ProgressBar1.Position: = 0;
End;
the Get Another form IdHTTP1 string is acquired: e.g. , the above procedure can be rewritten as:
procedure TForm1.Button1Click (Sender: TObject);
var
MyStr: String;
the begin
IdAntiFreeze1.OnlyWhenIdle: = False; // make the program is provided responsive.
The try
MyStr: = IdHTTP1.Get ( 'http://www.138soft.com/default.htm');
the except
ShowMessage ( 'Network Error!');
the Exit;
End;
ShowMessage (MyStr);
End;

Guess you like

Origin www.cnblogs.com/blogpro/p/11346002.html