faqts : Computers : Programming : Languages : Python : Common Problems

+ Search
Add Entry AlertManage Folder Edit Entry Add page to http://del.icio.us/
Did You Find This Entry Useful?

24 of 34 people (71%) answered Yes
Recently 6 of 10 people (60%) answered Yes

Entry

What type of sort is used for the internal sort function in Python (quick, merge, insertion, etc.)?

Jun 13th, 2000 23:55
Rob Hooft, unknown unknown, Peter Schneider-Kamp, Tim Peters


The following assumes that we talk about lists in the standard (C) 
Python edition.

binary insertion sort is used for small lists (<100 elements) whereas
a quicksort derived sorting algorithm called "samplesort" is used for
larger lists. A description can be found in the list object 
implementation:

samplesort is basically quicksort on steroids:  a power of 2 close to 
n/ln(n) is computed, and that many elements (less 1) are picked at 
random from the array and sorted.  These 2**k - 1 elements are then used 
as preselected pivots for an equal number of quicksort partitioning 
steps, partitioning the slice into 2**k chunks each of size about ln(n). 
These small final chunks are then usually handled by binarysort.  Note 
that when k=1, this is roughly the same as an ordinary quicksort using a 
random pivot, and when k=2 this is roughly a median-of-3 quicksort.  
From that view, using k ~= lg(n/ln(n)) makes this a "median of n/ln(n)" 
quicksort.  You can also view it as a kind of bucket sort, where 2**k-1 
bucket boundaries are picked dynamically.

The large number of samples makes a quadratic-time case almost 
impossible, and asymptotically drives the average-case number of 
compares from quicksort's 2 N ln N (or 12/7 N ln N for the median-of-3 
variant) down to N lg N.

-----

The current binary_insertion/samplesort hybrid was introduced in Python
1.5.2.  In earlier releases a variety of plain quicksorts were used,
including-- in the very early days --libc's qsort.  The latter had to be
dropped because it wasn't reentrant on some platforms (== bad results 
and/or even segfaults), and ran horribly slowly on some platforms in 
some common degenerate cases.

The point to the current hybrid is that it minimizes comparisons while
(believe it or not, and it's hard to believe if you study the code!) 
doing much less data movement than a mergesort.  Comparisons are 
especially expensive in Python, because lists are heterogenous and each 
time a pair of elements gets compared the correct type-varying function 
for doing the comparison needs to be figured out and then indirectly 
invoked.  So minimizing compares is crucial for a Python sort.  I wasn't 
able to write a mergesort (which also does few compares) as fast as the 
hybrid, because mergesort shuffles data around a lot more -- there are 
few enough compares now that data movement expense is no longer lost in 
the shuffle <ahem>.  The current hybrid is also an in-place sort.  Its 
only drawbacks are the complexity of the algorithm and that the 
resulting sort isn't stable.

-----

The "sort" function in the Numeric python code still uses the libc's
qsort() function at this moment.