Get results faster with Java multithreading

Posted on June 25th, 2011
Previous Article :: Next Article

Code Parallelization

Code parallelization is the process of modifying a simulation code to make it run faster by splitting the workload among multiple computers (well, in the very general sense). There are many options available. MPI is a widely used networked protocol that allows programs running on different computers to communicate with each other. It is deigned for use on clusters, arrays of hundreds and even thousands of individual computers linked together by network cables. OpenMP uses a different model. It is a set of compiler constructs that allow incorporation of multithreading into C and Fortran codes. As such, it is suitable for fine-grade parallelization on machines containing multiple cores sharing the same memory (such as most modern PCs). A relative newcomer to the game is Cuda. Cuda is a special interface language that allows you to write code that runs not on the CPU, but on a compatible NVIDIA GPU (graphics card). GPUs are basically highly optimized vector computers. They can execute single instruction on multiple data concurrently. In the world of particle plasma simulations, this means you can for instance push hundreds of particles at the same time.


These three methods allow you to write your program in a specific way so that it can run faster. Usually the serial (single processor) version is written first, and after it is shown to work for a smaller domain set, the program is parallelized, and eventually used to run simulation on a large domain. Parallelization of a serial code is a nontrivial task. It requires significant code changes and time devoted to debugging. However, sometimes parallelization is not actually necessary. Often, we need to run a single program multiple times to analyze the dependence of results on some input parameter. Instead of making each case execute faster, we can obtain the final set in shorter time by simply running the multiple cases concurrently. If the number of cases is small, this can be done by simply building several versions of the executable and launching them individually. However, if the number of required cases is larger than the number of CPUs, such an approach will result in non-optimal performance. Often, we are also interested in doing some post-processing of the data – each simulation may correspond to just a single point on an XY plot. We prefer for the main program to parse these, and output a single file containing results from all the simulations.

We can hence utilize multithreading by launching each case as a separate thread. A simple scheduler will make sure than only as many threads are running at one time as there are CPU cores. And while OpenMP is required to add multithreading into C codes, Java supports multithreading natively. If your code already happens to be written in Java, it becomes very simple to add multithreading support.

It should be noted that true parallelization offers some benefits over the simple method presented here. For instance, by distributing the problem to multiple CPUs we can simulate domains larger than could fit in the memory of a single machine. The method presented here results in each case running serially and hence memory constraints still apply. This method is best suited for codes that do not require large amounts of memory, and also take comparable computational times for the inputs tested.


We will demonstrate this method using a simple example: computing pi.The task at hand is: compute pi by sampling some specified number of points and compute the error. Simple way to calculate pi is to pick random points in [0,1]x[0,1] domain. Next imagine that a circle of radius 1 is drawn at the origin. One quarter of the circle will be located in this 1×1 domain. The ratio of points lying inside the circle to the total number of points is directly related to the ratio of the area of the quartercicle to the square. In other words, [latex]displaystyle n_c=frac{pi}{4}n_0[/latex]. We can determine if the point is inside or outside the circle from sqrt(x*x+y*y)<1. Due to the statistical number of computations, the computed error will vary from run to run. To get a more accurate result, we need to repeat the simulation multiple times and average the result. The code below illustrates how this task would be handled serially. The main function consists of a loop that executes the simulation 24 times. After each simulation finishes, the code adds the resulting error to a running sum. Once the loop completes, the average error is shown on the screen. Of course, your program would do something more useful than calculate pi. For instance, the calls to calculatePI() could be calls to execute the main loop in a particle in cell plasma simulation for a different range of input plasma parameters…

  1. /*------ -----------*/
  2. public class Main 
  3. {
  5. 	public static void main(String[] args) 
  6. 	{
  7. 		double sum = 0;
  8. 		int ns=24;   /*number of computations*/
  10.                 /*sample 50,000 x 1,000 points*/
  11. 		Simulation sim = new Simulation(50000);
  13. 		for (int i=0;i<ns;i++)
  14. 		{
  15. 			sim.computePI();
  16. 			sum += sim.getError();
  17. 		}
  19. 		System.out.printf("Average error is %gn", sum/ns);
  20. 	}
  21. }
  23. /*------ -----------*/
  24. public class Simulation
  25. {
  26. 	protected double result;			/*result*/
  27. 	protected double error;				/*error in percent*/
  29. 	protected int nk;
  31. 	double getResult() {return result;}
  32. 	double getError() {return error;}
  34. 	Simulation (int nk)
  35. 	{
  36. 		this.nk = nk;
  37. 	}
  39. 	/*CalculatesPI by taking nk*1000 sample points*/
  40. 	void computePI()
  41. 	{
  42. 		double x,y;
  43. 		double r;
  44. 		int i,j=0;
  45. 		int count=0;
  47. 		for (i=0;i<nk;i++)
  48. 			for (j=0;j<1000;j++)
  49. 			{
  50. 				/*select random point*/
  51. 				x = Math.random();
  52. 				y = Math.random();
  54. 				r=Math.sqrt(x*x+y*y);
  55. 				if (r<=1.0)
  56. 					count++;
  57. 			}
  59. 		result = 4*count/(double)(nk*j);
  60. 		error = 100*Math.abs(result-Math.PI)/Math.PI;
  61. 		System.out.printf(" nk = %d:t pi = %g,t error = %.2g%%n", nk,result,error);
  62. 	}
  63. }

Parallel version

Our goal is to execute each computation as an individual thread. This requires few small changes to our class Simulation. First, the class needs to extend the base class Thread. The class must implement a routine public void run(), which will be executed when the thread is started by calling start(). This routine then in turns executes the computation. Finally we modify the constructor to take in few extra parameters: name and ThreadGroup. These are passed over to the base class. Java lets you define multiple thread groups to better organize the computation. The name argument can be used to uniquely identify the thread. Although we give each thread a unique name, this is not necessary. The name is also not being used by our example.

  1. /*------ -----------*/
  2. class Simulation extends Thread
  3. {
  4.     protected double result;			/*result*/
  5.     protected double error;			/*error in percent*/
  7.     protected int nk;
  9.     double getResult() {return result;}
  10.     double getError() {return error;}
  12.     Simulation (int nk, String name, ThreadGroup tg)
  13.     {
  14.         super(tg,name);
  15. 	this.nk = nk;
  16.     }
  18.     public void run()
  19.     {
  20. 	computePI();
  21.     }
  23.     /*CalculatesPI by taking nk*1000 sample points*/
  24.     void computePI()
  25.     {
  26.            /* same code as above */
  27.     }
  28. }

We next need to modify the main loop. Since the simulations run concurrently, we need to instantiate each case as a new instance. The instances are stored in an array list. We create only one thread group, and associate each instance with it. We also determine how many processors Java has access to. We next incorporate a simple scheduler. The scheduler checks if the number of currently active threads is less than the number of available cores. If so, it launches another simulation from the list of cases to run. Otherwise if there are no free cores, it goes to sleep for 0.1 seconds. This sleep command is necessary to prevent overloading the CPU with calls to activeCount. The loop continues until all cases have been launched. After this, we wait for all simulations to finish running using a similar logic. We then loop through all our instances and calculate the running sum. The sum is outputted, along with the simulation time.

  1. /*------ -----------*/
  2. import java.util.*;
  4. public class Main 
  5. {
  6. 	public static void main(String[] args) 
  7. 	{
  8. 		ThreadGroup tg = new ThreadGroup("main");
  9. 		int np = Runtime.getRuntime().availableProcessors();
  10. 		int i, ns=24;
  12. 		List<Simulation> sims = new ArrayList<Simulation>();
  14. 		long start = System.nanoTime();
  16. 		for (i=0;i<ns;i++)
  17. 			sims.add(new Simulation(50000,"PI"+i, tg));
  19. 		i=0;
  20. 		while (i<sims.size())
  21. 		{
  22. 			/*do we have available CPUs?*/
  23. 			if (tg.activeCount()<np)
  24. 			{
  25. 				Simulation sim = sims.get(i);
  26. 				sim.start();
  27. 				i++;
  28. 			} else
  29. 				try {Thread.sleep(100);} /*wait 0.1 second before checking again*/
  30. 					catch (InterruptedException e) {e.printStackTrace();}
  31. 		}
  34. 		/*wait for threads to finish*/
  35. 		while(tg.activeCount()>0)
  36. 		{
  37. 			try {Thread.sleep(100);} 
  38. 				catch (InterruptedException e) {e.printStackTrace();}
  39. 		}
  41. 		/*sum up errors*/
  42. 		double sum=0;
  43. 		for (i=0;i<sims.size();i++)
  44. 		{
  45. 			Simulation sim = sims.get(i);
  46. 			sum+=sim.getError();
  47. 		}
  49. 		long end  = System.nanoTime();
  51. 		System.out.printf("Average error is %gn", sum/sims.size());
  52. 		System.out.printf("Simulation took %.2g secondsn", (double)(end-start)/1e9);
  53. 	}
  54. }

Performance checking

That’s it, or is it? Whenever parallelizing code, you should always include benchmarking to make sure that the parallelized version indeed works as expected. Case in point. The serial version takes 75 seconds to complete on my machine. The version utilizing 8 CPU cores takes 250 seconds! And this is despite the Windows Task Manager showing 100% utilization of the CPUs. Not only did increasing the machine power by a factor of 8 not speed up the code, it in fact slowed it down by over 3 times. What’s going on?

Java documentation states “This method is properly synchronized to allow correct use by more than one thread. However, if many threads need to generate pseudorandom numbers at a great rate, it may reduce contention for each thread to have its own pseudorandom-number generator.” In other words, the built-in random function is designed such that only one tread can access it at a time. Despite launching multiple threads, only one is actually executing at any given time. The others are waiting for the other thread to return from the call to random. To correct this issue, we need each thread to create it’s own instance of the random number generator, Random rnd = new java.util.Random();. We then sample random numbers using rnd.nextDouble();. The updated code is shown below.

  1. /*------ -----------*/
  2. import java.util.Random;
  4. public class Simulation extends Thread
  5. {
  6. 	protected double result;			/*result*/
  7. 	protected double error;				/*error in percent*/
  9. 	protected int nk;
  10. 	Random rnd;
  12. 	double getResult() {return result;}
  13. 	double getError() {return error;}
  15. 	Simulation (int nk, String name, ThreadGroup tg)
  16. 	{
  17. 		super(tg,name);
  18. 		this.nk = nk;
  19. 		rnd = new java.util.Random();
  20. 	}
  22. 	public void run()
  23. 	{
  24. 		computePI();
  25. 	}
  27. 	/*CalculatesPI by taking nk*1000 sample points*/
  28. 	void computePI()
  29. 	{
  30. 		double x,y;
  31. 		double r;
  32. 		int i,j=0;
  33. 		int count=0;
  35. 		for (i=0;i<nk;i++)
  36. 			for (j=0;j<1000;j++)
  37. 			{
  38. 				/*select random point*/
  39. 				x = rnd.nextDouble();
  40. 				y = rnd.nextDouble();
  42. 				r=Math.sqrt(x*x+y*y);
  43. 				if (r<=1.0)
  44. 					count++;
  45. 			}
  47. 		result = 4*count/(double)(nk*j);
  48. 		error = 100*Math.abs(result-Math.PI)/Math.PI;
  49. 		System.out.printf(" nk = %d:t pi = %g,t error = %.2g%%n", nk,result,error);
  50. 	}
  51. }

This change reduced the computational time on 8 cores from 250 seconds to 16. Quite impressive! It also means that the serial version that took 75 seconds was sped up by a factor of 4.7 by employing 8 cores. This is a far stretch from the expected 8x speed up, however ratios like these are much more realistic. And that’s it. Feel free to leave a comment if you have any questions or if something isn’t clear.

11 comments to “Get results faster with Java multithreading”

  1. Shaunak
    June 26, 2011 at 12:10 pm


    I just read the article on Java Multithreading. You did a real good job with this one. It is very easy to understand.

  2. John
    November 9, 2011 at 9:08 am

    Here is the Java Multithreading example implemented as an applet in a webpage.

    It also makes use of the livescript capabilities to communicate between Java in the applet and Javascript in the webpage.

  3. Henrique
    March 14, 2012 at 5:37 pm


    Firstly, I want to congratulate your good work and explanation of multithreading in Java.

    I’m doing a parallel version of a algorithm and I have the same problem! I removed the synchronized methods implemented by me, but, I think there are something that slowdown my parallel application.

    Can you tell me another tips that can slowdown the speed of my application?

    Thank you very much!

    University Technologic Federal of Parana
    Enginnering Computing

    • March 18, 2012 at 7:03 am

      Hi Henrique, it’s kind of hard to answer your question without seeing the details of your code. Not sure what development environment you use, but NetBeans includes a profiler you can use to check the speed of your code which could give you some hints.

  4. Henrique
    April 25, 2012 at 5:47 am

    Hi Lubos,

    I’m using NetBeans.
    To test my application I use the NetBeans profiler. But, I only see the CPU/MEM usage.

    I tracked the function with more CPU usage and I found the “heavy function”.

    I parallelized this function and, I have the same result that you got. My 4 cores are in 100% of load of each core. But, I get the same time of sequential code! (like your first situation)

    After, I removed the synchronous functions and static methods, but the time continue poor!

    Can you tell me if NetBeans profiler has a feature that I can use to track it? If you can contact by e-mail, please, ok?


  5. September 26, 2016 at 9:48 am

    Very nice! Right to the point!
    When I tested the first parallel version, I thought: this is much slower. hehe xD

  6. waimar
    October 8, 2017 at 6:50 am

    I want to get high speed performance for my mathematical computation.So,I will use Thread in java.But when I run the simple code about thread in java from google,the execution time is more longer than the simple program.How I can solve it?
    Help me.

  7. Hannah Tejada
    March 5, 2018 at 3:40 pm

    If you’re every scholar looking to spend much less on the protection,
    have a take a look at citizen school student discounts .

    The top rated ( ) often signifies that the side
    is in the route of be up-to-date via the really worth upon the particularly.

  8. Ryan
    June 25, 2020 at 11:40 am

    I recommend not using java.util.Random over ThreadLocalRandom, not only is ThreadLocalRandom more readable, it is also perfect for this scenario: random numbers in parallel in thread pools (as stated by the javadocs), this also has much less overhead.

    • June 25, 2020 at 2:45 pm

      Thanks Ryan! I am not even familiar with this class. It may not have even existed when this article was written (using Java 6 I believe).

Leave a Reply

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre lang="" line="" escaped="" cssfile=""> In addition, you can use \( ...\) to include equations.