Summary of data collection on Facebook and Linkedin

Simulated login and data collection of Facebook and Linkedin social networking sites:

  1. Simulate login:    

          For facebook, the method of directly simulating user requests to obtain the user's cookie value is difficult or impossible to test, and then the selenium framework is used to directly operate the browser through code to obtain the cookie value of the operating account.

          For linkedin, because the website is similar, the same method is used to obtain the user's cookie.

 2. Data collection:

           Facebook data collection:

            1). Get user list task by keyword

                 The return value response obtained by simulating the query request is not the source code of the webpage in pure html format. It is necessary to investigate and research the return value to find the key content. After investigation, it is known that the key content is under the code tag and is annotated with a comment. When using source code to parse Jsoup parsing, the commented out information is not parsed. After searching and replacing, the required information is obtained after parsing with the Jsoup tool.       

            2). The task of obtaining the latest dynamic information released by the user

                  Similar to the task of getting user list information

           Linkedin network data collection:

            1). Get user list task by keyword

                  The return value obtained by the simulated request is html-like data. The key content is stored under the code tag after investigation, annotated with annotations, and exists in the form of json.

            2). The task of obtaining the latest dynamic information released by the user

                  The data returned by the request is stored under the js-like data structure, annotated with annotations, and exists in the form of json!

 

            

The key to this data collection is to simulate the login to obtain the cookie and parse the return value!

Attach the code of part of the website parsing module:

       linkedin:

       

	public void parserUserInfo() {
		String idStr="";
		File file=new File("d:\\facebook.html");
		
		String htmlcode=FileUtil.readString4("d:\\facebook.html");
		
		htmlcode=htmlcode.replace("<!--", " ");
		htmlcode=htmlcode.replace("-->", " ");
		
		Document doc=null;
		doc = Jsoup.parse(htmlcode);
		
		
		List<Element> userIdList=new ArrayList<Element>();
		userIdList=doc.select("._s0").select("._2dpc").select("._rw").select(".img");
		if(userIdList.size()>0){
			Element userId=userIdList.get(0);
			idStr=userId.attr("id");
			System.out.println(idStr);
			idStr=idStr.substring(idStr.lastIndexOf("_")+1);
			System.out.println(idStr);
		}
		
		
		List<Element> moreList=new ArrayList<Element>();
		moreList=doc.select(".pam").select(".uiBoxLightblue").select(".uiMorePagerPrimary");
		if(moreList.size()>0){
			Element more=moreList.get(0);
			String moreStr=more.attr("href");
			moreStr="https://www.facebook.com/"+moreStr+"&"+idStr+"&__a=1";
			System.out.println(moreStr);
		}
		
		List<Element> list=doc.select("._3u1").select("._gli").select("._5und");
		System.out.println("Data size:"+list.size());
		for(int i=0;i<list.size();i++){
			Element e=list.get(i);
			List<Element> nameList=new ArrayList<Element>();
			nameList=e.select("._6a").select("._6b").select("._5d-4");
			if(nameList.size()>0){
				Element u=nameList.get(0);
				System.out.println(u.text());
				
			}
			
			List<Element> glmList=new ArrayList<Element>();
			glmList=e.select("._glm");
			if(glmList.size()>0){
				Element glm=glmList.get(0);
				System.out.println(glm.text());
			}
			
			List<Element> ajwList=new ArrayList<Element>();
			ajwList=e.select("._ajw");
			if(ajwList.size()>0){
				String myInfoStr="";
				for(int j=0;j<ajwList.size();j++){
					Element myInfo=ajwList.get(j);
					myInfoStr=myInfoStr+" "+myInfo.text();
				}
				System.out.println(myInfoStr);
			}
			
			
			List<Element> imgList=new ArrayList<Element>();
			imgList=e.select("._8o").select("._8s").select(".lfloat").select("._ohe");
			if(imgList.size()>0){
				Element img=imgList.get(0);
				List<Element> imList=new ArrayList<Element>();
				imList=img.select("img");
				if(imList.size()>0){
					Element im=imList.get(0);
					String imgSrc=im.attr("src");
					System.out.println(imgSrc);
				}
				System.out.println("");
			}
			
		}
	}

    linkedin:

    

	@Override
	public void parserUserPageInfo() {
		List<UserPublish_Linkedin> parserPublicList=new ArrayList<UserPublish_Linkedin>();
		
//		htmlcodeUserPage=FileUtil.readTxtFile("D://linkedin.txt");
//		htmlcodeUserPage=FileUtil.readString4("D://linkedin.txt");
		int firstIndex=htmlcodeUserPage.indexOf("_nS('com.linkedin.shared.controllers.FsController')");
		int lastIndex=htmlcodeUserPage.indexOf("--></code><script id=");
		htmlcodeUserPage = htmlcodeUserPage.substring (firstIndex, lastIndex);
		System.out.println(htmlcodeUserPage);
		int efirstIndex=htmlcodeUserPage.indexOf("updates");
		htmlcodeUserPage="{\""+htmlcodeUserPage.substring(efirstIndex);
		System.out.println(htmlcodeUserPage);
		
		try {
			JSONObject  dataJson=new JSONObject(htmlcodeUserPage);
			JSONObject  updatesJson=dataJson.getJSONObject("updates");
//			JSONObject  blocksJson=updatesJson.getJSONObject("blocks");
			JSONArray blocksArray=updatesJson.getJSONArray("blocks");
			for(int i=0;i<blocksArray.length();i++){
				UserPublish_Linkedin userPublish=new UserPublish_Linkedin();
				JSONObject blocksJson=blocksArray.getJSONObject(i);
//				JSONObject mblocks=blocksJson.getJSONObject("blocks");
				JSONArray mBlocks=blocksJson.getJSONArray("blocks");
				
				//text Start
				JSONObject nBlocks=mBlocks.getJSONObject(2);
				
				JSONArray wadss=nBlocks.getJSONArray("wads");
				
				JSONObject wads=wadss.getJSONObject(0);
				
				String text=wads.get("text").toString();
				userPublish.setPublishContent(text);
				System.out.println(text);
				//text End
				//time Start
				JSONObject timeBlocks=mBlocks.getJSONObject(1);
				JSONArray timemArray=timeBlocks.getJSONArray("blocks");
				JSONObject time=timemArray.getJSONObject(1);
				String timeAgo=time.get("timeAgo").toString();
				userPublish.setPublishTime(timeAgo);
				System.out.println(timeAgo);
				//time End
				
				//nickName Start
				JSONObject nickBlocks=mBlocks.getJSONObject(0);
				String nickName=nickBlocks.get("name").toString();
				userPublish.setNickName(nickName);
				System.out.println(nickName);
				//nickName End
				
				//uId Start
				String updateId=blocksJson.getString("updateId");
				System.out.println(updateId);
				//uId End
				
				
				Date date=new Date();
				String crawlerTime=sdf.format(date);
				userPublish.setCrawlerTime(crawlerTime);
//				userPublish.setLinkedinUserId(param.getUser().getId()+"");
// userPublish.setUserUid(param.getUser().getUid());
				userPublish.setuId(updateId);
				
				parserPublicList.add(userPublish);
			}
		} catch (JSONException e) {
			e.printStackTrace ();
		}
	}

    

 

 

                 

          

     

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326349159&siteId=291194637