Java implements downloading attachments/pictures on the web server to the local

A newcomer, Xiaobai, recorded some problems and solutions encountered in the work. It is only for your own reference in future review. Welcome to God's guidance and communication.

Just started to learn to write a blog, please point out the bad points politely.

First, let's talk about the background

Use java to get content from website pages (crawl pages). The following two situations may be encountered (I only encountered these two situations for the time being).

1. The java background program can be downloaded directly through the external network

I won't go into too much discussion on this, just grab it as it is, and use the DOM operation in java in the background to batch modify the href attribute in the a tag and the src attribute in the img tag to an absolute path, and download or access it directly.

2. The java background program cannot be downloaded directly through the external network

There are two situations here:
The first one: Access the website directly through the browser, you can download it normally, but you cannot download attachments/pictures through the java background code jump access. The reason for this is that the website adds anti-leech links to related content that cannot be downloaded.
The second type: Intranet resource websites that cannot be accessed directly through browsers and need to be accessed through VPN/server proxy.
The above two situations are the focus of this article, and can be solved in a unified manner in the following way.

actual business needs

The general access form is as follows. There are two servers. Server A belongs to a certain organization and can only be accessed through the intranet. Server B belongs to another organization that can be directly accessed from the external network. Now that the two companies want to cooperate, what needs to be realized is: the user can access the website specified in the server A through the server B (the website has a login verification function).
Note: B can access A, and A needs to add B's network address to the whitelist.
When a user logs in to A's website through B, we need to record a series of verification parameters such as cookies and headers in the background, and add them to the message header to send to A's server. After the verification is passed, the website can be accessed normally. However, the hre link in the a tag and the src link in img cannot directly pass in the relevant parameters, which will cause the access to fail. It is not possible to display pictures and download related attachments.
Solution: Download the content that cannot be obtained directly to server B first, and then replace the relevant link of the content that cannot be obtained with the link downloaded to the resource on B. When the user needs to download, it is to obtain the resource download directly from B.

function implementation code

/**
 * Save the attachment to the local and return the local file name
 * @param fileUrl attachment network address
 * @param suffix attachment suffix name
 * @return local attachment name
 */
public String getLocalFileUrl(String fileUrl, String suffix ){
	String localFileUrl = null;
	String fileName = null;
	try {
		Md5Util md5 = Md5Util.getInstance();
		fileName = md5.string2string(fileUrl)+ "." +suffix;
			
		URL url = new URL(fileUrl);
		//set proxy
		Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress("Intranet network address", port number));  
	        // open the link  
	        HttpURLConnection conn = (HttpURLConnection)url.openConnection(proxy);  
	        //Add header verification information
	        conn.setRequestProperty("Proxy-Authorization", "Username:Password");  
	        conn.setRequestProperty("Host", "网址");  
	        conn.setRequestProperty("Cookie", "XXXXXXXXX");  
	        //Set the request method to "GET"  
	        conn.setRequestMethod("GET");
	        //The timeout response time is 5 seconds  
	        conn.setConnectTimeout(5 * 1000);  
	        // get data from input stream  
	        InputStream inStream = conn.getInputStream();
	        //Get binary data, get data in binary package, which is universal  
	        byte[] data = readInputStream(inStream);
	        //new a file object is used to save, the current project root directory is saved by default
	        HttpServletRequest request = ServletActionContext.getRequest();
	        //Get the server cache directory
	        String directoryURL = request.getSession().getServletContext().getRealPath("/");   //D:\workspace\.metadata\.me_tcat7\webapps\EducationalSystem\
// String directoryURL = "/home/........"; //Online environment linuxs here I use the absolute path
	        //Create a local file storage directory
	        File fileMkdir = new File(directoryURL + "/cacheFile/"+supportSchoolId());
	        if(!fileMkdir.exists()){
	        	fileMkdir.mkdirs ();
	        }
	        
	        localFileUrl = fileMkdir + "/" + fileName; //file path
	        File localFile = new File(localFileUrl);
	        if(!localFile.exists()){
	        	//Create a file
	        	localFile.createNewFile();
	        }
	        //create output stream  
	        FileOutputStream outStream = new FileOutputStream(localFile);  
	        //data input  
	        outStream.write(data);
	        //close the output stream  
	        outStream.close();
	} catch (Exception e) {
		e.printStackTrace ();
	}
	return fileName;
}
	
public byte[] readInputStream(InputStream inStream) throws Exception{  
        ByteArrayOutputStream outStream = new ByteArrayOutputStream();  
        //Create a Buffer string  
        byte[] buffer = new byte[1024];  
        //The length of the string read each time, if it is -1, it means all reading is complete  
        int len ​​= 0;  
        //Use an input stream to read data from the buffer  
        while( (len=inStream.read(buffer)) != -1 ){  
            //Use the output stream to write data into the buffer, the intermediate parameter represents where to start reading, and len represents the length of the read  
            outStream.write(buffer, 0, len);  
        }  
        //close the input stream  
        inStream.close();  
        //Write the data in outStream to memory  
        return outStream.toByteArray();  
}
Through the above code, the data in A is downloaded and saved to B. The above code was also found by me after searching online for a long time, plus my actual needs such as adding proxy, setting header, etc.

Finally, I would like to talk about the pits I encountered in this process.

1. Download to the local server. When the user clicks the link, the download link is obtained from the local server, which is realized by the href attribute of the a tag, but it will jump to a blank page and will not download directly. When the blank page is directly refreshed or the blank page address is copied Can I download the file when I visit it in a new tab? ? ?
Solution: Open the connection in a new tab by adding the tagart attribute value to _black.

2. After I use eclipse to run and debug this function on my own laptop without error, I deploy it to the linux server, but the file cannot be downloaded, and the error prompts 404. After several setbacks, I realized that the proxy account given by A did not have permission to create folders and files. So file creation fails. Of course, it is impossible to find it, so there is a 404.
Solution: remote tools such as xshell, go to the linux server remotely, enter the specified directory, change the directory permission to 777 (sudo chmod 777 specified directory), and finally report the file error to the changed directory.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325407446&siteId=291194637