Java Web Crawler Notes

Java Web Crawler Notes

HttpClient instead of the browser to initiate a request.

select the element is found, that is, elements, you want to get a specific value of a property, or use the attr ( "") method. tag content with text inside to get

Selector选择器概述
tagname: 通过标签查找元素,比如:a
ns|tag: 通过标签在命名空间查找元素,比如:可以用 fb|name 语法来查找 <fb:name> 元素
#id: 通过ID查找元素,比如:#logo
.class: 通过class名称查找元素,比如:.masthead
[attribute]: 利用属性查找元素,比如:[href]
[^attr]: 利用属性名前缀来查找元素,比如:可以用[^data-] 来查找带有HTML5 Dataset属性的元素
[attr=value]: 利用属性值来查找元素,比如:[width=500]
[attr^=value], [attr$=value], [attr*=value]: 利用匹配属性值开头、结尾或包含属性值来查找元素,比如:[href*=/path/]
[attr~=regex]: 利用属性值匹配正则表达式来查找元素,比如: img[src~=(?i)\.(png|jpe?g)]
*: 这个符号将匹配所有元素
Selector选择器组合使用
el#id: 元素+ID,比如: div#logo
el.class: 元素+class,比如: div.masthead
el[attr]: 元素+class,比如: a[href]
任意组合,比如:a[href].highlight
ancestor child: 查找某个元素下子元素,比如:可以用.body p 查找在"body"元素下的所有 p元素
parent > child: 查找某个父元素下的直接子元素,比如:可以用div.content > p 查找 p 元素,也可以用body > * 查找body标签下所有直接子元素
siblingA + siblingB: 查找在A元素之前第一个同级元素B,比如:div.head + div
siblingA ~ siblingX: 查找A元素之前的同级X元素,比如:h1 ~ p
el, el, el:多个选择器组合,查找匹配任一选择器的唯一元素,例如:div.masthead, div.logo
伪选择器selectors
:lt(n): 查找哪些元素的同级索引值(它的位置在DOM树中是相对于它的父节点)小于n,比如 td:lt(3) 表示小于三列的元素
:gt(n):查找哪些元素的同级索引值大于n,比如: div p:gt(2)表示哪些div中有包含2个以上的p元素
:eq(n): 查找哪些元素的同级索引值与n相等,比如:form input:eq(1)表示包含一个input标签的Form元素
:has(seletor): 查找匹配选择器包含元素的元素,比如:div:has(p)表示哪些div包含了p元素
:not(selector): 查找与选择器不匹配的元素,比如: div:not(.logo) 表示不包含 class=logo 元素的所有 div 列表
:contains(text): 查找包含给定文本的元素,搜索不区分大不写,比如: p:contains(jsoup)
:containsOwn(text): 查找直接包含给定文本的元素
:matches(regex): 查找哪些元素的文本匹配指定的正则表达式,比如:div:matches((?i)login)
:matchesOwn(regex): 查找自身包含文本匹配指定正则表达式的元素
注意:上述伪选择器索引是从0开始的,也就是说第一个元素索引值为0,第二个元素index为1等
可以查看Selector API参考来了解更详细的内容
String有一个replace方法,可以替换字符串.
    
在使用 springmvc 中,一般的定时任务是使用 job 或者 quartz 或者timer来实现,但是使用它们的时候比较麻烦,会在 xml 文件中配置很多,

springboot 的定时任务比较简单。

1、在 application 启动类中使用 @EnableScheduling 注解开启定时任务,会自动扫描,相当于一个开关,把这个开关开完之后,那么只要在相应的任务类中做相应的任务,那么就会被 spring boot 容器扫描到,扫描到后,根据任务定义的时间

会自动运行
 @Entity 表明该类 (UserEntity) 为一个实体类,它默认对应数据库中的表名是user_entity。这里也可以写成

      @Entity(name = "xwj_user")

      或者

      @Entity
      @Table(name = "xwj_user", schema = "test")

      查看@Entity注解,发现其只有一个属性name,表示其所对应的数据库中的表名
      
  @Table 当实体类与其映射的数据库表名不同名时需要使用 @Table注解说明,该标注与 @Entity 注解并列使用,置于实体类声明语句之前,可写于单独语句行,也可与声明语句同行。 
      @Table注解的常用选项是 name,用于指明数据库的表名 
      @Table注解还有两个选项 catalog 和 schema 用于设置表所属的数据库目录或模式,通常为数据库名
 

如果缺省@Table注解,则class字段名即表中的字段名,所以需要@Column注解来改变class中字段名与db中表的字段名的映射规则

@Column注释定义了将成员属性映射到关系表中的哪一列和该列的结构信息,属性如下:
  1)name:映射的列名。如:映射tbl_user表的name列,可以在name属性的上面或getName方法上面加入;
  2)unique:是否唯一;
  3)nullable:是否允许为空;
  4)length:对于字符型列,length属性指定列的最大字符长度;
  5)insertable:是否允许插入;
  6)updatetable:是否允许更新;
  7)columnDefinition:定义建表时创建此列的DDL;
  8)secondaryTable:从表名。如果此列不建在主表上(默认是主表),该属性定义该列所在从表的名字
  
  如果是主键id,还会用到@Id注解

@Id注释指定表的主键,它可以有多种生成方式:
  1)TABLE:容器指定用底层的数据表确保唯一;
  2)SEQUENCE:使用数据库德SEQUENCE列莱保证唯一(Oracle数据库通过序列来生成唯一ID);
  3)IDENTITY:使用数据库的IDENTITY列莱保证唯一;
  4)AUTO:由容器挑选一个合适的方式来保证唯一;
  5)NONE:容器不负责主键的生成,由程序来完成。
 

其中与@Id一起使用的还有另外两个注解:@GeneratedValue、@GenericGenerator,具体使用方法可参考hibernate中的@GeneratedValue与@GenericGenerator

If the database write and delete operations, please indicate: @Transactional

PageProcessor is part webmagic-core, custom-PageProcessor a reptile can realize their logic.

Are p

DOM way

Jsoup.parse () can be obtained dom

The DOM can be obtained using getelementByclass byid bytag corresponding element

element can have their own ways to get id, obtain class, gets the text text (), acquire other properties, you can also use attr () method to obtain the specified property, or simply use the attributes to get all the properties at once. These there are more, then, will extract all at once. separated by a space.

mode selector selector

Find elements by tag, such as span

dom.select("span")

Find elements by ID, to add a # in front

By CLASS find elements, to add a front.

Find by property [Properties] or [attribute = attribute value]

Label :: text in the text label can be obtained

Tags :: attr (attribute) Gets a property value

CSS or directly ( "selectors", "property name") to obtain the property value

The selector may be used in combination to add another choice to direct in a look behind.

Find a child element under an element: Find similar with the combination, the middle of a blank space plus> Finding direct child elements that will not look for indirect child elements * can replace all direct child elements...

@Scheduled(fixedDelay= number) 加一个定时任务,括号里填时间参数.number是毫秒数

For webmagic page processing logic of it.

Is page.html (). Css (there write selector) .toString ()

About timed task Cron expression:

1572919026559.png

Expressions can be generated online here: http://cron.qqe2.com/

Cron expression is a string of 5 or 6 separated by a space, is divided into six or seven domains, each domain representing a meaning, Cron following two syntax:

  (1) Seconds Minutes Hours DayofMonth Month DayofWeek Year

  (2)Seconds Minutes Hours DayofMonth Month DayofWeek

  First, the structure

  corn left to right (separated by spaces): second minute hour day of the month date week month year

  Second, the meaning of each field

Field allowance Allow special characters
Seconds (Seconds) An integer of 0 to 59 - * / four characters
分(Minutes An integer of 0 to 59 - * / four characters
Hours ( Hours ) An integer of 0 to 23 - * / four characters
Date ( dayOfMonth ) Integer of 1 to 31 (but you need to consider the number of days you months) , -? * / LWC eight characters
Month ( Month ) Or an integer of 1 to 12 JAN-DEC - * / four characters
Week ( DayofWeek ) Or an integer of 1 to 7 SUN-SAT (1 = SUN) , -? * / LC # eight characters
In (optional, leave blank) ( Year ) 1970~2099 - * / four characters

  Precautions:

  Each domain numbers, but can also appear as special characters, their meanings are:

  (1) : represents any value matches the domain. If used in the Minutes field , it means that every minute a trigger event.

  (2) can only be used ?: DayofMonth and DayofWeek two domains. It also matches any value of the field, but actually not. Because DayofMonth and DayofWeek affect each other. For example, you want to trigger scheduled on the 20th of each month, regardless of the 20 in the end of the week, you can only use the following wording:? 13,131,520 *, which can only be the last one? , Can not be used , if used representation regardless of the day of the week will be triggered, in fact not the case.

  (3) -: indicates a range. 5-20 Minutes domain using, for example, represents from 5 minutes assigned triggered once every 20 minutes

  (4) /: indicating a start time of a start trigger, and a trigger every predetermined time. For example, in the domain 5/20 Minutes, it means five minutes once triggered, 25, 45, respectively, while the trigger once.

  (5) ,: listed represent the enumeration values. For example: the use of 5,20 in Minutes field, it means that the trigger once per minute at 5 and 20 minutes.

  (6) L: represents the last, and only appeared in DayofWeek DayofMonth domain. If you use 5L in DayofWeek domain, meaning that triggered last Thursday.

  (7) W: represents the effective working days (Monday to Friday), can only appear in DayofMonth domain, the system will trigger events in recent days from the effective date specified. For example: the use of 5W in DayofMonth, if the 5th is a Saturday, then in recent days: Friday that the 4th trigger. If the 5th Sunday, then on the 6th (Zhouyi) is triggered; if, in the day Friday, the triggers on Monday 5 in 5 days. Another point, W recently looking for does not cross the month.

  (8) LW: These two characters can be used in conjunction, represent the last working day in a month, and last Friday.

  (9) #: means for determining a month the first few days of the week, only appear in DayofMonth domain. For example, in 4 # 2, it represents the second Wednesday of the month.

  Third, the common expression examples

  (1) 0021 *? * Represents the adjustment of each month at 2:00 1st task

  (2) 0 15 10? * MON-FRI said on Monday through Friday at 10:15 am every day to perform the job

  (3) ? 0 15 10 6L 2002-2006 represents 2002-- last Friday at 10:15 am for the implementation of the 2006 monthly

  (4) 00 10,14,16 * *? 10:00 every day, 14:00, 4:00

  (5) 0 9-17 0/30 * *? Inward nine to five hours every half hour

  (6) 0 0 12? * WED represent every Wednesday 12:00

  (7) 0012 * *? 12:00 every day trigger

  (8) 01 510? * * 10:15 am every day trigger

  (9) 01510 * *? 10:15 am every day trigger

  (10) 01510 * *? * 10:15 am every day trigger

  (11) 01510 * *? 2005 2005 triggered daily at 10:15 am

  (12) 0 * * * 14? Per minute triggered during every afternoon at 2:00 to 2:59 pm

  (13) 0 0/5 14 * *? During 2:00 pm to 2:55 pm each day of the trigger every 5 minutes

  (14) 0 0/5 * 14, 18 *? Every 5 minutes trigger and 6:00 pm to 6:55 during the period of 2:00 to 2:55 every afternoon

  (15) 00-514 * *? During 2:00 pm to 2:05 pm every day of every 1 minute when the

  (16) 0 14 10, 44? 3 WED March of each year on Wednesday 2:10 pm and 2:44 trigger

  (17) 0 15 10? * MON-FRI Monday to Friday 10:15 am to trigger

  (18) 0151015 *? 15th of each month at 10:15 am triggering

  (19) 0 15 10 * L? On the last day of each month 10:15 am triggering

  (20) 0 15 10? * 6L of each month at 10:15 am last Friday triggered

  (21) 0 15 10? * 6L 2002-2005 2002 to 2005 years last Friday of each month at 10:15 am triggering

  (22) 01 510? * 6 # 3 on every third Friday at 10:15 am triggering

  

  Note:

  (1) Some sub-expressions can contain a range or list

  For example: subexpression (days (weeks)) can be "MON-FRI", "MON, WED, FRI", "MON-WED, SAT"

"*" Character represents all possible values

  Thus, " " indicates the meaning of each month in the sub-expression (month), the " " in the sub-expression (days (weeks)) represent each day of the week

  "/" Character is used to increment a specified value
  , for example: in the subexpression (minutes) in the "0/15" starts from 0 minutes, every 15 minutes
in the sub-expression (minutes) in the "3/20 "starts from the first three minutes, every 20 minutes (and it" 3,23,43 ") as meaning

  "?" Character is used only day (month) and day (week) two sub-expression that do not specify a value
  when one of the two sub-expressions which are assigned a value in the future, in order to avoid conflict, you need to express another child style value is set to "?"

  "L" character is used only day (month) and day (week) two sub-expressions, which is an abbreviation of the word "last" of
  it meaning in the two sub-expressions is different.
  Day (month) sub-expression, "L" represents the last day of the month
  in the day (week) from the expression, "L" represents the last day of the week, which is the SAT

  If there are specific content in the "L" before, it has other meanings of

  For example: "6L" the reciprocal of the sixth day of the month, "FRIL" represents the best Friday of the month
  Note: When using the "L" parameter, do not specify the list or range, as this can cause problems

For the XPATH @, the brackets without the value of the attribute is selected, in square brackets is the label of the selected attribute,

// for child nodes can be anywhere.

/ Must be a direct child

cron表达式
按顺序依次为 
秒(0~59) 
分钟(0~59) 
小时(0~23) 
天(月)(0~31,但是你需要考虑你月的天数)
月(0~11) 
天(星期)(1~7 1=SUN 或 SUN,MON,TUE,WED,THU,FRI,SAT) 
7.年份(1970-2099) 

其中每个元素可以是一个值(如6),一个连续区间(9-12),一个间隔时间(8-18/4)(/表示每隔4小时),一个列表(1,3,5),通配符。由于"月份中的日期"和"星期中的日期"这两个元素互斥的,必须要对其中一个设置?. 
0 0 10,14,16 * * ? 每天上午10点,下午2点,4点 
0 0/30 9-17 * * ?   朝九晚五工作时间内每半小时 
0 0 12 ? * WED 表示每个星期三中午12点 
有些子表达式能包含一些范围或列表 
例如:子表达式(天(星期))可以为 “MON-FRI”,“MON,WED,FRI”,“MON-WED,SAT” 
“*”字符代表所有可能的值 
因此,“*”在子表达式(月)里表示每个月的含义,“*”在子表达式(天(星期))表示星期的每一天 

“/”字符用来指定数值的增量 
例如:在子表达式(分钟)里的“0/15”表示从第0分钟开始,每15分钟 
         在子表达式(分钟)里的“3/20”表示从第3分钟开始,每20分钟(它和“3,23,43”)的含义一样 

“?”字符仅被用于天(月)和天(星期)两个子表达式,表示不指定值 
当2个子表达式其中之一被指定了值以后,为了避免冲突,需要将另一个子表达式的值设为“?” 

“L” 字符仅被用于天(月)和天(星期)两个子表达式,它是单词“last”的缩写 
但是它在两个子表达式里的含义是不同的。 
在天(月)子表达式中,“L”表示一个月的最后一天 
在天(星期)自表达式中,“L”表示一个星期的最后一天,也就是SAT 
如果在“L”前有具体的内容,它就具有其他的含义了 
例如:“6L”表示这个月的倒数第6天,“FRIL”表示这个月的最一个星期五 
注意:在使用“L”参数时,不要指定列表或范围,因为这会导致问题 

附:cronExpression配置说明 

字段 允许值 允许的特殊字符 
秒 0-59 , - * / 
分 0-59 , - * / 
小时 0-23 , - * / 
日期 1-31 , - * ? / L W C 
月份 1-12 或者 JAN-DEC , - * / 
星期 1-7 或者 SUN-SAT , - * ? / L C # 
年(可选) 留空, 1970-2099 , - * / 
表达式 意义 
"0 0 12 * * ?" 每天中午12点触发 
"0 15 10 ? * *" 每天上午10:15触发 
"0 15 10 * * ?" 每天上午10:15触发 
"0 15 10 * * ? *" 每天上午10:15触发 
"0 15 10 * * ? 2005" 2005年的每天上午10:15触发 
"0 * 14 * * ?" 在每天下午2点到下午2:59期间的每1分钟触发 
"0 0/5 14 * * ?" 在每天下午2点到下午2:55期间的每5分钟触发 
"0 0/5 14,18 * * ?" 在每天下午2点到2:55期间和下午6点到6:55期间的每5分钟触发 
"0 0-5 14 * * ?" 在每天下午2点到下午2:05期间的每1分钟触发 
"0 10,44 14 ? 3 WED" 每年三月的星期三的下午2:10和2:44触发 
"0 15 10 ? * MON-FRI" 周一至周五的上午10:15触发 
"0 15 10 15 * ?" 每月15日上午10:15触发 
"0 15 10 L * ?" 每月最后一日的上午10:15触发 
"0 15 10 ? * 6L" 每月的最后一个星期五上午10:15触发 
"0 15 10 ? * 6L 2002-2005" 2002年至2005年的每月的最后一个星期五上午10:15触发 
"0 15 10 ? * 6#3" 每月的第三个星期五上午10:15触发 
特殊字符 意义 
* 表示所有值; 
? 表示未说明的值,即不关心它为何值; 
- 表示一个指定的范围; 
, 表示附加一个可能值; 
/ 符号前表示开始时间,符号后表示每次递增的值; 
L("last") ("last") "L" 用在day-of-month字段意思是 "这个月最后一天";用在 day-of-week字段, 它简单意思是 "7" or "SAT"。 如果在day-of-week字段里和数字联合使用,它的意思就是 "这个月的最后一个星期几" – 例如: "6L" means "这个月的最后一个星期五". 当我们用“L”时,不指明一个列表值或者范围是很重要的,不然的话,我们会得到一些意想不到的结果。 
W("weekday") 只能用在day-of-month字段。用来描叙最接近指定天的工作日(周一到周五)。例如:在day-of-month字段用“15W”指“最接近这个 月第15天的工作日”,即如果这个月第15天是周六,那么触发器将会在这个月第14天即周五触发;如果这个月第15天是周日,那么触发器将会在这个月第 16天即周一触发;如果这个月第15天是周二,那么就在触发器这天触发。注意一点:这个用法只会在当前月计算值,不会越过当前月。“W”字符仅能在 day-of-month指明一天,不能是一个范围或列表。也可以用“LW”来指定这个月的最后一个工作日。 
# 只能用在day-of-week字段。用来指定这个月的第几个周几。例:在day-of-week字段用"6#3"指这个月第3个周五(6指周五,3指第3个)。如果指定的日期不存在,触发器就不会触发。 
C 指和calendar联系后计算过的值。例:在day-of-month 字段用“5C”指在这个月第5天或之后包括calendar的第一天;在day-of-week字段用“1C”指在这周日或之后包括calendar的第一天。

java (Mybatis) crawlers overall sequence:

  1. Application open the regular tasks

  2. Write a task tasks like starting reptiles (set time, set processor, setting pipeline, set the url)

  3. Processor prepare, process method implemented, or implemented CSS or Jsoup XPATH selector selects data to store the data in the pipeline.

  4. Write pipeline, injection Mapper, to come up with data from resultitems, the data is updated, the method Mapper, and stored in the database. (Or write a service interface with the service stored in the database, of course mapper method service calls.)

  5. Write Mapper. In mapper inside by way of notes to write sql., CRUD,

  6. Write pojo, mapper need pojo as a parameter.

  7. 4,5,6 strip mapper can also be used springdatajpa

    About XPATH selector Click here

webmagic is the first thing each page are added to the pipeline, the last one ending out unification saved to the database.

To get the text for a tag that does not contain text labels of its children.

Guess you like

Origin www.cnblogs.com/shiguangqingqingchui/p/11922738.html