Simulated login and data collection of Facebook and Linkedin social networking sites:
1. Simulate login:
For facebook, the method of directly simulating user requests to obtain the user's cookie value is difficult or impossible to test, and then the selenium framework is used to directly operate the browser through code to obtain the cookie value of the operating account.
For linkedin, because the website is similar, the same method is used to obtain the user's cookie.
2. Data collection:
Facebook data collection:
1). Get user list task by keyword
The return value response obtained by simulating the query request is not the source code of the webpage in pure html format. It is necessary to investigate and research the return value to find the key content. After investigation, it is known that the key content is under the code tag and is annotated with a comment. When using source code to parse Jsoup parsing, the commented out information is not parsed. After searching and replacing, the required information is obtained after parsing with the Jsoup tool.
2). The task of obtaining the latest dynamic information released by the user
Similar to the task of getting user list information
Linkedin network data collection:
1). Get user list task by keyword
The return value obtained by the simulated request is html-like data. The key content is stored under the code tag after investigation, annotated with annotations, and exists in the form of json.
2). The task of obtaining the latest dynamic information released by the user
The data returned by the request is stored under the js-like data structure, annotated with annotations, and exists in the form of json!
The key to this data collection is to simulate the login to obtain the cookie and parse the return value!
Attach the code of part of the website parsing module:
linkedin:
public void parserUserInfo() { String idStr=""; File file=new File("d:\\facebook.html"); String htmlcode=FileUtil.readString4("d:\\facebook.html"); htmlcode=htmlcode.replace("<!--", " "); htmlcode=htmlcode.replace("-->", " "); Document doc=null; doc = Jsoup.parse(htmlcode); List<Element> userIdList=new ArrayList<Element>(); userIdList=doc.select("._s0").select("._2dpc").select("._rw").select(".img"); if(userIdList.size()>0){ Element userId=userIdList.get(0); idStr=userId.attr("id"); System.out.println(idStr); idStr=idStr.substring(idStr.lastIndexOf("_")+1); System.out.println(idStr); } List<Element> moreList=new ArrayList<Element>(); moreList=doc.select(".pam").select(".uiBoxLightblue").select(".uiMorePagerPrimary"); if(moreList.size()>0){ Element more=moreList.get(0); String moreStr=more.attr("href"); moreStr="https://www.facebook.com/"+moreStr+"&"+idStr+"&__a=1"; System.out.println(moreStr); } List<Element> list=doc.select("._3u1").select("._gli").select("._5und"); System.out.println("Data size:"+list.size()); for(int i=0;i<list.size();i++){ Element e=list.get(i); List<Element> nameList=new ArrayList<Element>(); nameList=e.select("._6a").select("._6b").select("._5d-4"); if(nameList.size()>0){ Element u=nameList.get(0); System.out.println(u.text()); } List<Element> glmList=new ArrayList<Element>(); glmList=e.select("._glm"); if(glmList.size()>0){ Element glm=glmList.get(0); System.out.println(glm.text()); } List<Element> ajwList=new ArrayList<Element>(); ajwList=e.select("._ajw"); if(ajwList.size()>0){ String myInfoStr=""; for(int j=0;j<ajwList.size();j++){ Element myInfo=ajwList.get(j); myInfoStr=myInfoStr+" "+myInfo.text(); } System.out.println(myInfoStr); } List<Element> imgList=new ArrayList<Element>(); imgList=e.select("._8o").select("._8s").select(".lfloat").select("._ohe"); if(imgList.size()>0){ Element img=imgList.get(0); List<Element> imList=new ArrayList<Element>(); imList=img.select("img"); if(imList.size()>0){ Element im=imList.get(0); String imgSrc=im.attr("src"); System.out.println(imgSrc); } System.out.println(""); } } }
linkedin:
@Override public void parserUserPageInfo() { List<UserPublish_Linkedin> parserPublicList=new ArrayList<UserPublish_Linkedin>(); // htmlcodeUserPage=FileUtil.readTxtFile("D://linkedin.txt"); // htmlcodeUserPage=FileUtil.readString4("D://linkedin.txt"); int firstIndex=htmlcodeUserPage.indexOf("_nS('com.linkedin.shared.controllers.FsController')"); int lastIndex=htmlcodeUserPage.indexOf("--></code><script id="); htmlcodeUserPage = htmlcodeUserPage.substring (firstIndex, lastIndex); System.out.println(htmlcodeUserPage); int efirstIndex=htmlcodeUserPage.indexOf("updates"); htmlcodeUserPage="{\""+htmlcodeUserPage.substring(efirstIndex); System.out.println(htmlcodeUserPage); try { JSONObject dataJson=new JSONObject(htmlcodeUserPage); JSONObject updatesJson=dataJson.getJSONObject("updates"); // JSONObject blocksJson=updatesJson.getJSONObject("blocks"); JSONArray blocksArray=updatesJson.getJSONArray("blocks"); for(int i=0;i<blocksArray.length();i++){ UserPublish_Linkedin userPublish=new UserPublish_Linkedin(); JSONObject blocksJson=blocksArray.getJSONObject(i); // JSONObject mblocks=blocksJson.getJSONObject("blocks"); JSONArray mBlocks=blocksJson.getJSONArray("blocks"); //text Start JSONObject nBlocks=mBlocks.getJSONObject(2); JSONArray wadss=nBlocks.getJSONArray("wads"); JSONObject wads=wadss.getJSONObject(0); String text=wads.get("text").toString(); userPublish.setPublishContent(text); System.out.println(text); //text End //time Start JSONObject timeBlocks=mBlocks.getJSONObject(1); JSONArray timemArray=timeBlocks.getJSONArray("blocks"); JSONObject time=timemArray.getJSONObject(1); String timeAgo=time.get("timeAgo").toString(); userPublish.setPublishTime(timeAgo); System.out.println(timeAgo); //time End //nickName Start JSONObject nickBlocks=mBlocks.getJSONObject(0); String nickName=nickBlocks.get("name").toString(); userPublish.setNickName(nickName); System.out.println(nickName); //nickName End //uId Start String updateId=blocksJson.getString("updateId"); System.out.println(updateId); //uId End Date date=new Date(); String crawlerTime=sdf.format(date); userPublish.setCrawlerTime(crawlerTime); // userPublish.setLinkedinUserId(param.getUser().getId()+""); // userPublish.setUserUid(param.getUser().getUid()); userPublish.setuId(updateId); parserPublicList.add(userPublish); } } catch (JSONException e) { e.printStackTrace (); } }