Black reptile science and technology, how I crawled indeed jobs data

A recent study nodejs crawler technology, learning a request module, so thinking about writing a reptile own projects, research for a long time, indeed as the final selection of the target site, take indeed the position data by crawling, and then develop a own job search engine , it is now on the line, although the function is relatively simple, but still posted about URLs the Job Search Engine , reptile about this project is to prove useful. Here's to say something about the whole idea of reptiles.

Determine entry page

As we all know, reptiles need entry page by entry page, constantly crawling links, and finally crawling the full site. In this first step, encountered difficulties in general are selected and a list of home page as the entry page, but the page did indeed list restrictions, can not crawl a complete list of, at most, can only crawl 100 pages, but it did not bother me, I found that indeed there is a Browse Jobs page, through this page, you can indeed get a list of all search by region and by type of search. Below posted about parsing code for this page.

start: async (page) => {
  const host = URL.parse(page.url).hostname;
  const tasks = [];
  try {
    const $ = cheerio.load(iconv.decode(page.con, 'utf-8'), { decodeEntities: false });
    $('#states > tbody > tr > td > a').each((i, ele) => {
      const url = URL.resolve(page.url, $(ele).attr('href'));
      tasks.push({ _id: md5(url), type: 'city', host, url, done: 0, name: $(ele).text() });
    });
    $('#categories > tbody > tr > td > a').each((i, ele) => {
      const url = URL.resolve(page.url, $(ele).attr('href'));
      tasks.push({ _id: md5(url), type: 'category', host, url, done: 0, name: $(ele).text() });
    });
    const res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {});
    res && console.log(`${host}-start insert ${res.insertedCount} from ${tasks.length} tasks`);
    return 1;
  } catch (err) {
    console.error(`${host}-start parse ${page.url} ${err}`);
    return 0;
  }
}

By cheerio parse html content, the search by region and by type search links inserted into the database.

Reptile architecture

I'm simply here to talk about the idea of reptiles architecture, database selection mongodb. Each of memory pages to be crawled Page a record, comprising id, url, done, type, host and other fields, id used md5(url)to generate, to avoid duplication. Each type has a corresponding html content analysis method, the main business logic which are concentrated in these analytical methods, the code is posted above example.

Html crawling use request module, a simple package, packaged into the callback Promise, easy to use and async call await mode, as follows.

const req = require('request');

const request = req.defaults({
  headers: {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
  },
  timeout: 30000,
  encoding: null
});

const fetch = (url) => new Promise((resolve) => {
    console.log(`down ${url} started`);
    request(encodeURI(url), (err, res, body) => {
      if (res && res.statusCode === 200) {
        console.log(`down ${url} 200`);
        resolve(body);
      } else {
        console.error(`down ${url} ${res && res.statusCode} ${err}`);
        if (res && res.statusCode) {
          resolve(res.statusCode);
        } else {
          // ESOCKETTIMEOUT 超时错误返回600
          resolve(600);
        }
      }
    });
  });

Do a simple anti-anti-climbing process, the user-agent computer into a common user-agent, a timeout time of 30 seconds, in which the encoding: nullcontent setting request directly back buffer, rather than parse later, this benefit is if the page is gbk or utf-8 encoding, as long as the time specified parsing html coding on the line, if this is specified encoding: utf-8, then when the page code is gbk when the page content will be garbled.

The default is to request a callback function form, by promise package, if successful, the page content returned buffer, if it fails, the error status code is returned, if time-out, 600 returned understand nodejs of these should be well understood.

Complete parsing code

const URL = require('url');
const md5 = require('md5');
const cheerio = require('cheerio');
const iconv = require('iconv-lite');

const json = (data) => {
  let res;
  try {
    res = JSON.parse(data);
  } catch (err) {
    console.error(err);
   }
  return res;
};

const rules = [
  /\/jobs\?q=.*&sort=date&start=\d+/,
  /\/jobs\?q=&l=.*&sort=date&start=\d+/
];

const fns = {

  start: async (page) => {
    const host = URL.parse(page.url).hostname;
    const tasks = [];
    try {
      const $ = cheerio.load(iconv.decode(page.con, 'utf-8'), { decodeEntities: false });
      $('#states > tbody > tr > td > a').each((i, ele) => {
        const url = URL.resolve(page.url, $(ele).attr('href'));
        tasks.push({ _id: md5(url), type: 'city', host, url, done: 0, name: $(ele).text() });
      });
      $('#categories > tbody > tr > td > a').each((i, ele) => {
        const url = URL.resolve(page.url, $(ele).attr('href'));
        tasks.push({ _id: md5(url), type: 'category', host, url, done: 0, name: $(ele).text() });
      });
      const res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {});
      res && console.log(`${host}-start insert ${res.insertedCount} from ${tasks.length} tasks`);
      return 1;
    } catch (err) {
      console.error(`${host}-start parse ${page.url} ${err}`);
      return 0;
    }
  },

  city: async (page) => {
    const host = URL.parse(page.url).hostname;
    const tasks = [];
    const cities = [];
    try {
      const $ = cheerio.load(iconv.decode(page.con, 'utf-8'), { decodeEntities: false });
      $('#cities > tbody > tr > td > p.city > a').each((i, ele) => {
        // https://www.indeed.com/l-Charlotte,-NC-jobs.html
        let tmp = $(ele).attr('href').match(/l-(?<loc>.*)-jobs.html/u);
        if (!tmp) {
          tmp = $(ele).attr('href').match(/l=(?<loc>.*)/u);
        }
        const { loc } = tmp.groups;
        const url = `https://www.indeed.com/jobs?l=${decodeURIComponent(loc)}&sort=date`;
        tasks.push({ _id: md5(url), type: 'search', host, url, done: 0 });
        cities.push({ _id: `${$(ele).text()}_${page.name}`, parent: page.name, name: $(ele).text(), url });
      });
      let res = await global.com.city.insertMany(cities, { ordered: false }).catch(() => {});
      res && console.log(`${host}-city insert ${res.insertedCount} from ${cities.length} cities`);

      res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {});
      res && console.log(`${host}-city insert ${res.insertedCount} from ${tasks.length} tasks`);
      return 1;
    } catch (err) {
      console.error(`${host}-city parse ${page.url} ${err}`);
      return 0;
    }
  },

  category: async (page) => {
    const host = URL.parse(page.url).hostname;
    const tasks = [];
    const categories = [];
    try {
      const $ = cheerio.load(iconv.decode(page.con, 'utf-8'), { decodeEntities: false });
      $('#titles > tbody > tr > td > p.job > a').each((i, ele) => {
        const { query } = $(ele).attr('href').match(/q-(?<query>.*)-jobs.html/u).groups;
        const url = `https://www.indeed.com/jobs?q=${decodeURIComponent(query)}&sort=date`;
        tasks.push({ _id: md5(url), type: 'search', host, url, done: 0 });
        categories.push({ _id: `${$(ele).text()}_${page.name}`, parent: page.name, name: $(ele).text(), url });
      });
      let res = await global.com.category.insertMany(categories, { ordered: false }).catch(() => {});
      res && console.log(`${host}-category insert ${res.insertedCount} from ${categories.length} categories`);

      res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {});
      res && console.log(`${host}-category insert ${res.insertedCount} from ${tasks.length} tasks`);
      return 1;
    } catch (err) {
      console.error(`${host}-category parse ${page.url} ${err}`);
      return 0;
    }
  },

  search: async (page) => {
    const host = URL.parse(page.url).hostname;
    const tasks = [];
    const durls = [];
    try {
      const con = iconv.decode(page.con, 'utf-8');
      const $ = cheerio.load(con, { decodeEntities: false });
      const list = con.match(/jobmap\[\d+\]= {.*}/g);
      const jobmap = [];
      if (list) {
         // eslint-disable-next-line no-eval
        list.map((item) => eval(item));
      }
      for (const item of jobmap) {
        const cmplink = URL.resolve(page.url, item.cmplnk);
        const { query } = URL.parse(cmplink, true);
        let name;
        if (query.q) {
          // eslint-disable-next-line prefer-destructuring
          name = query.q.split(' #')[0].split('#')[0];
        } else {
          const tmp = cmplink.match(/q-(?<text>.*)-jobs.html/u);
          if (!tmp) {
            // eslint-disable-next-line no-continue
            continue;
          }
          const { text } = tmp.groups;
          // eslint-disable-next-line prefer-destructuring
          name = text.replace(/-/g, ' ').split(' #')[0];
        }
        const surl = `https://www.indeed.com/cmp/_cs/cmpauto?q=${name}&n=10&returnlogourls=1&returncmppageurls=1&caret=8`;
        const burl = `https://www.indeed.com/viewjob?jk=${item.jk}&from=vjs&vjs=1`;
        const durl = `https://www.indeed.com/rpc/jobdescs?jks=${item.jk}`;
        tasks.push({ _id: md5(surl), type: 'suggest', host, url: surl, done: 0 });
        tasks.push({ _id: md5(burl), type: 'brief', host, url: burl, done: 0 });
        durls.push({ _id: md5(durl), type: 'detail', host, url: durl, done: 0 });
      }
      $('a[href]').each((i, ele) => {
        const tmp = URL.resolve(page.url, $(ele).attr('href'));
        const [url] = tmp.split('#');
        const { path, hostname } = URL.parse(url);
        for (const rule of rules) {
          if (rule.test(path)) {
            if (hostname == host) {
              // tasks.push({ _id: md5(url), type: 'list', host, url: decodeURI(url), done: 0 });
            }
            break;
          }
        }
      });

      let res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {});
      res && console.log(`${host}-search insert ${res.insertedCount} from ${tasks.length} tasks`);

      res = await global.com.task.insertMany(durls, { ordered: false }).catch(() => {});
      res && console.log(`${host}-search insert ${res.insertedCount} from ${durls.length} tasks`);

      return 1;
    } catch (err) {
      console.error(`${host}-search parse ${page.url} ${err}`);
      return 0;
    }
  },

  suggest: async (page) => {
    const host = URL.parse(page.url).hostname;
    const tasks = [];
    const companies = [];
    try {
      const con = page.con.toString('utf-8');
      const data = json(con);
      for (const item of data) {
        const id = item.overviewUrl.replace('/cmp/', '');
        const cmpurl = `https://www.indeed.com/cmp/${id}`;
        const joburl = `https://www.indeed.com/cmp/${id}/jobs?clearPrefilter=1`;
        tasks.push({ _id: md5(cmpurl), type: 'company', host, url: cmpurl, done: 0 });
        tasks.push({ _id: md5(joburl), type: 'jobs', host, url: joburl, done: 0 });
        companies.push({ _id: id, name: item.name, url: cmpurl });
      }

      let res = await global.com.company.insertMany(companies, { ordered: false }).catch(() => {});
      res && console.log(`${host}-suggest insert ${res.insertedCount} from ${companies.length} companies`);

      res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {});
      res && console.log(`${host}-suggest insert ${res.insertedCount} from ${tasks.length} tasks`);
      return 1;
    } catch (err) {
      console.error(`${host}-suggest parse ${page.url} ${err}`);
      return 0;
    }
  },

  // list: () => {},

  jobs: async (page) => {
    const host = URL.parse(page.url).hostname;
    const tasks = [];
    const durls = [];
    try {
      const con = iconv.decode(page.con, 'utf-8');
      const tmp = con.match(/window._initialData=(?<text>.*);<\/script><script>window._sentryData/u);
      let data;
      if (tmp) {
        const { text } = tmp.groups;
        data = json(text);
        if (data.jobList && data.jobList.pagination && data.jobList.pagination.paginationLinks) {
          for (const item of data.jobList.pagination.paginationLinks) {
            // eslint-disable-next-line max-depth
            if (item.href) {
              item.href = item.href.replace(/\u002F/g, '/');
              const url = URL.resolve(page.url, decodeURI(item.href));
              tasks.push({ _id: md5(url), type: 'jobs', host, url: decodeURI(url), done: 0 });
            }
          }
        }
        if (data.jobList && data.jobList.jobs) {
          for (const job of data.jobList.jobs) {
            const burl = `https://www.indeed.com/viewjob?jk=${job.jobKey}&from=vjs&vjs=1`;
            const durl = `https://www.indeed.com/rpc/jobdescs?jks=${job.jobKey}`;
            tasks.push({ _id: md5(burl), type: 'brief', host, url: burl, done: 0 });
            durls.push({ _id: md5(durl), type: 'detail', host, url: durl, done: 0 });
          }
        }
      } else {
        console.log(`${host}-jobs ${page.url} has no _initialData`);
      }
      let res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {});
      res && console.log(`${host}-search insert ${res.insertedCount} from ${tasks.length} tasks`);

      res = await global.com.task.insertMany(durls, { ordered: false }).catch(() => {});
      res && console.log(`${host}-search insert ${res.insertedCount} from ${durls.length} tasks`);

      return 1;
    } catch (err) {
      console.error(`${host}-jobs parse ${page.url} ${err}`);
      return 0;
    }
  },

  brief: async (page) => {
    const host = URL.parse(page.url).hostname;
    try {
      const con = page.con.toString('utf-8');
      const data = json(con);
      data.done = 0;
      data.views = 0;
      data.host = host;
      // format publish date
      if (data.vfvm && data.vfvm.jobAgeRelative) {
        const str = data.vfvm.jobAgeRelative;
        const tmp = str.split(' ');
        const [first, second] = tmp;
        if (first == 'Just' || first == 'Today') {
          data.publishDate = Date.now();
        } else {
          const num = first.replace(/\+/, '');
          if (second == 'hours') {
            const date = new Date();
            const time = date.getTime();
            // eslint-disable-next-line no-mixed-operators
            date.setTime(time - num * 60 * 60 * 1000);
            data.publishDate = date.getTime();
          } else if (second == 'days') {
            const date = new Date();
            const time = date.getTime();
            // eslint-disable-next-line no-mixed-operators
            date.setTime(time - num * 24 * 60 * 60 * 1000);
            data.publishDate = date.getTime();
          } else {
            data.publishDate = Date.now();
          }
        }
      }
      await global.com.job.updateOne({ _id: data.jobKey }, { $set: data }, { upsert: true }).catch(() => { });

      const tasks = [];
      const url = `https://www.indeed.com/jobs?l=${data.jobLocationModel.jobLocation}&sort=date`;
      tasks.push({ _id: md5(url), type: 'search', host, url, done: 0 });
      const res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {});
      res && console.log(`${host}-brief insert ${res.insertedCount} from ${tasks.length} tasks`);
      return 1;
    } catch (err) {
      console.error(`${host}-brief parse ${page.url} ${err}`);
      return 0;
    }
  },

  detail: async (page) => {
    const host = URL.parse(page.url).hostname;
    try {
      const con = page.con.toString('utf-8');
      const data = json(con);
      const [jobKey] = Object.keys(data);
      await global.com.job.updateOne({ _id: jobKey }, { $set: { content: data[jobKey], done: 1 } }).catch(() => { });
      return 1;
    } catch (err) {
      console.error(`${host}-detail parse ${page.url} ${err}`);
      return 0;
    }
  },

  run: (page) => {
    if (page.type == 'list') {
      page.type = 'search';
    }
    const fn = fns[page.type];
    if (fn) {
      return fn(page);
    }
    console.error(`${page.url} parser not found`);
    return 0;
  }

};

module.exports = fns;

Each analytical method will insert some new links, new links will have a record field type, by type field, you can know the analytical method of the new link, so that we can fully resolved all of the pages. For example start method inserts type and category of records for the city, for the type of city records analytical method page is citya method, a method which city will insert links to search for the type, so that has been circulating until the final brief and detail method to obtain respectively About jobs data and details.

In fact, the most critical is that these reptiles html analytical methods, with these methods, you will be able to get any structured content want.

Data Index

This part is very simple, the foregoing structure with the acquired data, according to elasticsearch, a new Schema, then the timing of the write program data is added to the index es positions inside the line. Because post details of the content a bit more, I do not have to add fields to index content inside, because too much memory, server memory is not enough,> _ <.

DEMO

Finally paste the URL for your review, the Job Search Engine .

Guess you like

Origin blog.51cto.com/14684137/2480526