mirror of
				https://github.com/saitohirga/WSJT-X.git
				synced 2025-10-25 10:00:23 -04:00 
			
		
		
		
	
		
			
	
	
		
			251 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
		
		
			
		
	
	
			251 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
|  | [/ | ||
|  | Copyright 2015 Paul A. Bristow. | ||
|  | Copyright 2015 John Maddock. | ||
|  | Distributed under the Boost Software License, Version 1.0. | ||
|  | (See accompanying file LICENSE_1_0.txt or copy at | ||
|  | http://www.boost.org/LICENSE_1_0.txt). | ||
|  | ] | ||
|  | 
 | ||
|  | [section:root_comparison  Comparison of Root Finding Algorithms] | ||
|  | 
 | ||
|  | [section:cbrt_comparison  Comparison of Cube Root Finding Algorithms] | ||
|  | 
 | ||
|  | In the table below, the cube root of 28 was computed for three __fundamental_types floating-point types, | ||
|  | and one __multiprecision type __cpp_bin_float using 50 decimal digit precision, using four algorithms. | ||
|  | 
 | ||
|  | The 'exact' answer was computed using a 100 decimal digit type: | ||
|  | 
 | ||
|  |    cpp_bin_float_100 full_answer ("3.036588971875662519420809578505669635581453977248111123242141654169177268411884961770250390838097895"); | ||
|  | 
 | ||
|  | Times were measured using  __boost_timer using `class cpu_timer`. | ||
|  | 
 | ||
|  | * ['Its] is the number of iterations taken to find the root. | ||
|  | * ['Times] is the CPU time-taken in arbitrary units. | ||
|  | * ['Norm] is a normalized time, in comparison to the quickest algorithm (with value 1.00). | ||
|  | * ['Dis] is the distance from the nearest representation of the 'exact' root in bits. | ||
|  | Distance from the 'exact' answer is measured by using function __float_distance. | ||
|  | One or two bits distance means that all results are effectively 'correct'. | ||
|  | Zero means 'exact' - the nearest __representable value for the floating-point type. | ||
|  | 
 | ||
|  | The cube-root function is a simple function, and is a contrived example for root-finding. | ||
|  | It does allow us to investigate some of the factors controlling efficiency that may be extrapolated to | ||
|  | more complex functions. | ||
|  | 
 | ||
|  | The program used was [@ ../../example/root_finding_algorithms.cpp root_finding_algorithms.cpp]. | ||
|  | 100000 evaluations of each floating-point type and algorithm were used and the CPU times were | ||
|  | judged from repeat runs to have an uncertainty of 10 %.  Comparing MSVC for `double` and `long double` | ||
|  | (which are identical on this patform) may give a guide to uncertainty of timing. | ||
|  | 
 | ||
|  | The requested precision was set as follows: | ||
|  | 
 | ||
|  | [table | ||
|  | [[Function][Precision Requested]] | ||
|  | [[TOMS748][numeric_limits<T>::digits - 2]] | ||
|  | [[Newton][floor(numeric_limits<T>::digits * 0.6)]] | ||
|  | [[Halley][floor(numeric_limits<T>::digits * 0.4)]] | ||
|  | [[Schr'''ö'''der][floor(numeric_limits<T>::digits * 0.4)]] | ||
|  | ] | ||
|  | 
 | ||
|  | * The C++ Standard cube root function [@http://en.cppreference.com/w/cpp/numeric/math/cbrt std::cbrt] | ||
|  | is only defined for built-in or fundamental types, | ||
|  | so cannot be used with any User-Defined floating-point types like __multiprecision. | ||
|  | This, and that the cube function is so impeccably-behaved, | ||
|  | allows the implementer to use many tricks to achieve a fast computation. | ||
|  | On some platforms,`std::cbrt` appeared several times as quick as the more general `boost::math::cbrt`, | ||
|  | on other platforms / compiler options `boost::math::cbrt` is noticeably faster.  In general, the results are highly | ||
|  | dependent on the code-generation / processor architecture selection compiler options used.  One can | ||
|  | assume that the standard library will have been compiled with options ['nearly] optimal for the platform | ||
|  | it was installed on, where as the user has more choice over the options used for Boost.Math.  Pick something | ||
|  | too general/conservative and performance suffers, while selecting options that make use of the latest | ||
|  | instruction set opcodes speed's things up noticeably. | ||
|  | 
 | ||
|  | * Two compilers in optimise mode were compared: GCC 4.9.1 using Netbeans IDS | ||
|  | and Microsoft Visual Studio 2013 (Update 1) on the same hardware. | ||
|  | The number of iterations seemed consistent, but the relative run-times surprisingly different. | ||
|  | 
 | ||
|  | * `boost::math::cbrt` allows use with ['any user-defined floating-point type], conveniently | ||
|  | __multiprecision.  It too can take some advantage of the good-behaviour of the cube function, | ||
|  | compared to the more general implementation in the nth root-finding examples.  For example, | ||
|  | it uses a polynomial approximation to generate a better guess than dividing the exponent by three, | ||
|  | and can avoid the complex checks in __newton required to prevent the | ||
|  | search going wildly off-track.  For a known precision, it may also be possible to | ||
|  | fix the number of iterations, allowing inlining and loop unrolling.  It also | ||
|  | algebraically simplifies the Halley steps leading to a big reduction in the | ||
|  | number of floating point operations required compared to a "black box" implementation | ||
|  | that calculates the derivatives seperately and then combines them in the Halley code. | ||
|  | Typically, it was found that computation using type `double` | ||
|  | took a few times longer when using the various root-finding algorithms directly rather | ||
|  | than the hand coded/optimized `cbrt` routine. | ||
|  | 
 | ||
|  | * The importance of getting a good guess can be seen by the iteration count for the multiprecision case: | ||
|  | here we "cheat" a little and use the cube-root calculated to double precision as the initial guess. | ||
|  | The limitation of this tactic is that the range of possible (exponent) values may be less than the multiprecision type. | ||
|  | 
 | ||
|  | * For __fundamental_types, there was little to choose between the three derivative methods, | ||
|  | but for __cpp_bin_float, __newton was twice as fast.  Note that the cube-root is an extreme | ||
|  | test case as the cost of calling the functor is so cheap that the runtimes are largely | ||
|  | dominated by the complexity of the iteration code. | ||
|  | 
 | ||
|  | * Compiling with optimisation halved computation times, and any differences between algorithms | ||
|  | became nearly negligible.  The optimisation speed-up of the __TOMS748 was especially noticable. | ||
|  | 
 | ||
|  | * Using a multiprecision type like `cpp_bin_float_50` for a precision of 50 decimal digits | ||
|  | took a lot longer, as expected because most computation | ||
|  | uses software rather than 64-bit floating-point hardware. | ||
|  | Speeds are often more than 50 times slower. | ||
|  | 
 | ||
|  | * Using `cpp_bin_float_50`,  __TOMS748 was much slower showing the benefit of using derivatives. | ||
|  | __newton was found to be twice as quick as either of the second-derivative methods: | ||
|  | this is an extreme case though, the function and its derivatives are so cheap to compute that we're | ||
|  | really measuring the complexity of the boilerplate root-finding code. | ||
|  | 
 | ||
|  | * For multiprecision types only one or two extra ['iterations] are needed to get the remaining 35 digits, whatever the algorithm used. | ||
|  | (The time taken was of course much greater for these types). | ||
|  | 
 | ||
|  | * Using a 100 decimal-digit type only doubled the time and required only a very few more iterations, | ||
|  | so the cost of extra precision is mainly the underlying cost of computing more digits, | ||
|  | not in the way the algorithm works.  This confirms previous observations using __NTL high-precision types. | ||
|  | 
 | ||
|  | [include root_comparison_tables_msvc.qbk] | ||
|  | [include root_comparison_tables_gcc.qbk] | ||
|  | 
 | ||
|  | [endsect] [/section:cbrt_comparison  Comparison of Cube Root Finding Algorithms] | ||
|  | 
 | ||
|  | [section:root_n_comparison Comparison of Nth-root Finding Algorithms] | ||
|  | 
 | ||
|  | A second example compares four generalized nth-root finding algorithms for various n-th roots (5, 7 and 13) | ||
|  | of a single value 28.0, for four floating-point types, `float`, `double`, | ||
|  | `long double` and a __multiprecision type `cpp_bin_float_50`. | ||
|  | In each case the target accuracy was set using our "recomended" accuracy limits  | ||
|  | (or at least limits that make a good starting point - which is likely to give | ||
|  | close to full accuracy without resorting to unnecessary iterations). | ||
|  | 
 | ||
|  | [table | ||
|  | [[Function][Precision Requested]] | ||
|  | [[TOMS748][numeric_limits<T>::digits - 2]] | ||
|  | [[Newton][floor(numeric_limits<T>::digits * 0.6)]] | ||
|  | [[Halley][floor(numeric_limits<T>::digits * 0.4)]] | ||
|  | [[Schr'''ö'''der][floor(numeric_limits<T>::digits * 0.4)]] | ||
|  | ] | ||
|  | Tests used Microsoft Visual Studio 2013 (Update 1) and GCC 4.9.1 using source code | ||
|  | [@../../example/root_n_finding_algorithms.cpp root_n_finding_algorithms.cpp]. | ||
|  | 
 | ||
|  | The timing uncertainty (especially using MSVC) is at least 5% of normalized time 'Norm'. | ||
|  | 
 | ||
|  | To pick out the 'best' and 'worst' algorithms are highlighted in blue and red. | ||
|  | More than one result can be 'best' when normalized times are indistinguishable | ||
|  | within the uncertainty. | ||
|  | 
 | ||
|  | [/include roots_table_100_msvc.qbk] | ||
|  | [/include roots_table_75_msvc.qbk] | ||
|  | 
 | ||
|  | [/include roots_table_75_msvc_X86.qbk] | ||
|  | [/include roots_table_100_msvc_X86.qbk] | ||
|  | 
 | ||
|  | [/include roots_table_100_msvc_AVX.qbk] | ||
|  | [/include roots_table_75_msvc_AVX.qbk] | ||
|  | 
 | ||
|  | [/include roots_table_75_msvc_X86_SSE2.qbk] | ||
|  | [/include roots_table_100_msvc_X86_SSE2.qbk] | ||
|  | 
 | ||
|  | [/include roots_table_100_gcc_X64_SSE2.qbk] | ||
|  | [/include roots_table_75_gcc_X64_SSE2.qbk] | ||
|  | 
 | ||
|  | [/include type_info_table_100_msvc.qbk] | ||
|  | [/include type_info_table_75_msvc.qbk] | ||
|  | 
 | ||
|  | [include roots_table_100_msvc_X86_SSE2.qbk] | ||
|  | [include roots_table_100_msvc_X64_AVX.qbk] | ||
|  | [include roots_table_100_gcc_X64_SSE2.qbk] | ||
|  | 
 | ||
|  | Some tentative conclusions can be drawn from this limited exercise. | ||
|  | 
 | ||
|  | * Perhaps surprisingly, there is little difference between the various algorithms for __fundamental_types floating-point types. | ||
|  | Using the first derivatives (__newton) is usually the best, but while the improvement over the no-derivative | ||
|  | __TOMS748 is considerable in number of iterations, but little in execution time.  This reflects the fact that the function | ||
|  | we are finding the root for is trivial to evaluate, so runtimetimes are dominated by the time taken by the boilerplate code | ||
|  | in each method. | ||
|  | 
 | ||
|  | * The extra cost of evaluating the second derivatives (__halley or __schroder) is usually too much for any net benefit: | ||
|  | as with the cube root, these functors are so cheap to evaluate that the runtime is largely dominated by the | ||
|  | complexity of the root finding method. | ||
|  | 
 | ||
|  | * For a __multiprecision floating-point type, the __newton is a clear winner with a several-fold gain over __TOMS748, | ||
|  | and again no improvement from the second-derivative algorithms. | ||
|  | 
 | ||
|  | * The run-time of 50 decimal-digit __multiprecision is about 30-fold greater than `double`. | ||
|  | 
 | ||
|  | * The column 'dis' showing the number of bits distance from the correct result. | ||
|  | The Newton-Raphson algorithm shows a bit or two better accuracy than __TOMS748. | ||
|  | 
 | ||
|  | * The goodness of the 'guess' is especially crucial for __multiprecision. | ||
|  | Separate experiments show that evaluating the 'guess' using `double` allows | ||
|  | convergence to the final exact result in one or two iterations. | ||
|  | So in this contrived example, crudely dividing the exponent by N for a 'guess', | ||
|  | it would be far better to use a `pow<double>` or , | ||
|  | if more precise `pow<long double>`, function to estimate a 'guess'. | ||
|  | The limitation of this tactic is that the range of possible (exponent) values may be less than the multiprecision type. | ||
|  | 
 | ||
|  | * Using floating-point extension __SSE2 made a modest ten-percent speedup. | ||
|  | 
 | ||
|  | *Using MSVC, there was some improvement using 64-bit, markedly for __multiprecision. | ||
|  | 
 | ||
|  | * The GCC compiler 4.9.1 using 64-bit was at least five-folder faster that 32-bit, | ||
|  | apparently reflecting better optimization. | ||
|  | 
 | ||
|  | Clearly, your mileage [*will vary], but in summary, __newton seems the first choice of algorithm, | ||
|  | and effort to find a good 'guess' the first speed-up target, especially for __multiprecision. | ||
|  | And of course, compiler optimisation is crucial for speed. | ||
|  | 
 | ||
|  | [endsect] [/section:root_n_comparison Comparison of Nth-root Finding Algorithms] | ||
|  | 
 | ||
|  | [section:elliptic_comparison Comparison of Elliptic Integral Root Finding Algoritghms] | ||
|  | 
 | ||
|  | A second example compares four root finding algorithms for locating | ||
|  | the second radius of an ellipse with first radius 28 and arc length 300,  | ||
|  | for four floating-point types, `float`, `double`, | ||
|  | `long double` and a __multiprecision type `cpp_bin_float_50`. | ||
|  | 
 | ||
|  | Which is to say we're solving: | ||
|  | 
 | ||
|  | [pre 4xE(sqrt(1 - 28[super 2] / x[super 2])) - 300 = 0] | ||
|  | 
 | ||
|  | In each case the target accuracy was set using our "recomended" accuracy limits  | ||
|  | (or at least limits that make a good starting point - which is likely to give | ||
|  | close to full accuracy without resorting to unnecessary iterations). | ||
|  | 
 | ||
|  | [table | ||
|  | [[Function][Precision Requested]] | ||
|  | [[TOMS748][numeric_limits<T>::digits - 2]] | ||
|  | [[Newton][floor(numeric_limits<T>::digits * 0.6)]] | ||
|  | [[Halley][floor(numeric_limits<T>::digits * 0.4)]] | ||
|  | [[Schr'''ö'''der][floor(numeric_limits<T>::digits * 0.4)]] | ||
|  | ] | ||
|  | Tests used Microsoft Visual Studio 2013 (Update 1) and GCC 4.9.1 using source code | ||
|  | [@../../example/root_elliptic_finding.cpp root_elliptic_finding.cpp]. | ||
|  | 
 | ||
|  | The timing uncertainty (especially using MSVC) is at least 5% of normalized time 'Norm'. | ||
|  | 
 | ||
|  | To pick out the 'best' and 'worst' algorithms are highlighted in blue and red. | ||
|  | More than one result can be 'best' when normalized times are indistinguishable | ||
|  | within the uncertainty. | ||
|  | 
 | ||
|  | [include elliptic_table_100_msvc_X86_SSE2.qbk] | ||
|  | [include elliptic_table_100_msvc_X64_AVX.qbk] | ||
|  | [include elliptic_table_100_gcc_X64_SSE2.qbk] | ||
|  | 
 | ||
|  | Remarks: | ||
|  | 
 | ||
|  | * The function being solved is now moderately expensive to call, and twice as expensive to call | ||
|  | when obtaining the derivative than when not.  Consequently there is very little improvement in moving | ||
|  | from a derivative free method, to Newton iteration.  However, once you've calculated the first derivative | ||
|  | the second comes almost for free, consequently the third order methods (Halley) does much the best. | ||
|  | * Of the two second order methods, Halley does best as would be expected: the Schroder method offers better | ||
|  | guarantees of ['quadratic] convergence, while Halley relies on a smooth function with a single root to | ||
|  | give ['cubic] convergence.  It's not entirely clear why Schroder iteration often does worse than Newton. | ||
|  | 
 | ||
|  | [endsect][/section:elliptic_comparison Comparison of Elliptic Integral Root Finding Algoritghms] | ||
|  | 
 | ||
|  | [endsect] [/section:root_comparison  Comparison of Root Finding Algorithms] | ||
|  | 
 |