1. 同步执行
for循环一条条抓取,这种方式最简单但效率最差,遇到网站响应慢的url会阻塞掉后面的执行。
2.异步方式
每个url开一个进程来处理:
String[] urls = { "url1", "url2", "url3", "url4", "url5", "url6", "url7" }; for (String url : urls) { Thread t = new Thread(new Runnable(){ public void run(){ //fetch site } }) } t.start()
这种方式能利用多线程同时并发http请求,最大的提高吞吐量。但这种方案也有问题:
1. 多少个图片就开多少个线程,线程数不可控,如果是一万张图片就启动一万个thread,明显资源有问题。
2. 大量启动线程也有性能消耗。
3. 使用线程池
通过配置线程池来做到资源可控。
import java.util.ArrayList; import java.util.List; import java.util.concurrent.ArrayBlockingQueue; import java.util.concurrent.BlockingQueue; public class ThreadPool { private List<Thread> threads = new ArrayList<Thread>(); private BlockingQueue<Runnable> queue = new ArrayBlockingQueue<Runnable>(20); public ThreadPool(int size) { for (int i = 0; i < size; ++i) { Thread thread = new Thread(new Worker(queue)); // thread.setDaemon(true); thread.start(); threads.add(thread); } } public void sumbit(Runnable runnable) { queue.add(runnable); } private static class Worker implements Runnable { private BlockingQueue<Runnable> queue; public Worker(BlockingQueue<Runnable> queue) { super(); this.queue = queue; } @Override public void run() { while (true) { Runnable runnable = queue.poll(); if(runnable!=null){ runnable.run(); } try { Thread.sleep(1000); } catch (InterruptedException e) { e.printStackTrace(); } } } } }
public class ThreadPoolFetcher { public static void main(String[] args) { ThreadPool pool = new ThreadPool(7); String[] urls = { "url1", "url2", "url3", "url4", "url5", "url6", "url7" }; for (String url : urls) { pool.sumbit(new Fetcher(url)); } } private static class Fetcher implements Runnable { private String url; public Fetcher(String url) { super(); this.url = url; } public void run() { System.out.println(Thread.currentThread().getName() + ":" + url); try { Thread.sleep(1000); } catch (InterruptedException e) { e.printStackTrace(); } } } }
这种方式利用生产者消费者的方式来实现线程池,做到资源可控。
不过这种方式有点问题是,执行完成之后线程池里的线程不会退出。
4. 使用Executors轻松搞定:
最后还是使用jdk5提供的Executors轻松搞定吧:
import java.util.ArrayList; import java.util.List; import java.util.concurrent.Callable; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; public class Fetcher { public static void main(String[] args) throws InterruptedException { String[] urls = {"url1","url2","url3","url4","url5","url6","url7"}; ExecutorService exs = Executors.newFixedThreadPool(100); List<Callable<String>> tasks = new ArrayList<Callable<String>>(); for(String url :urls){ tasks.add(new FetchImageTask(url)); } exs.invokeAll(tasks); System.out.println("end"); exs.shutdown(); } /** * @author yunpeng * */ private static class FetchImageTask implements Callable<String>{ private String url; public FetchImageTask(String url) { super(); this.url = url; } @Override public String call() throws Exception { System.out.println(Thread.currentThread().getName()); Thread.sleep(3000); return "ok"+url; } } }