Chapter 7 Searching and Efficiency

One of the most common and important uses of computer programs is to search a large set of data for a specific record or observation—indeed, a significant portion of what computers do is search through data. This chapter covers a number of common patterns and algorithms used when searching through lists of data in order to answer questions about that data. Note that this chapter introduces no new syntax, but instead provides a deeper look at using loops and lists. The first section illustrates common interactions with lists, while the remaining consider at a high level the efficiency of algorithms.

7.2 Linear Search Speed

Searching is something that you do a lot (and on bigger and bigger data sets!), so it’s worth considering: how fast is this process? How efficient is the algorithm? Is there possibly a more efficient way of searching?

Warning: AVOID PRE-MATURE OPTIMIZATION!! While this section and chapter discuss the “speed” of computer programs, you should avoid spending too much (or any!) time trying to make your program “as fast as possible”. The first step in any computer program is to make it function at all. Once you have it working, then you can worry about increasing the efficiency—and only if it is currently too slow for your purposes. Pre-mature optimization (trying to make it work fast before you make it work at all) is a major source of bugs and other problems.

One way of measuring the speed of an algorithm would be to time it, such as by using a stopwatch. Python includes modules that can be used to record the time, allowing you to get the “start” and “stop” time of the algorithm, and then calculate the elapsed duration. However, this wall-clock efficiency is highly dependent on the exact list being searched and on the computer that is executing the algorithm—if you’re also streaming videos while searching, the algorithm may run slower!

Instead, computer scientists measure the efficiency of a program by counting the number of operations that the algorithm does. This number will be independent of the data and machine, and so makes it easier to compare approaches. Specifically, to measure the efficiency of a search, you would consider the number of comparisons that need to be made between elements (e.g., how many elements you check before you find what you’re looking for), under the assumption that this is the most time-consuming part of the computer’s search algorithm.

Linear search involves looking at each item in the list one at a time, so the number of “checks” that need to be made is dependent on the size of the list. And because the item you’re looking for may be either be at the beginning of the list (meaning you don’t search for very long) or at the end, you can consider both the “average” case (when it’s in the middle), as well as the “worst” case (when it’s at the end or not in the list at all!):

len(list) # comparisons (avg) # comparisons (worst)
10 5 10
20 10 20
50 25 50
100 50 100
1000 500 1000
N N/2 N

These numbers should be somewhat intuitive: because you’re looking at each element in a line, in the average case you need to look at half of the elements, and in the worst case you need to look at all of them! As such, the number of comparisons you need to make (and thus the efficiency of the algorithm) is a linear function of the size of the list:

Linear time complexity. Image by Nick Salloum.

This is in fact why it is called a linear search!

In general, we measure algorithm efficiency (or more properly, algorithmic complexity) in terms of the rate of change in the speed: that is, if you double the size of the input list, by what ratio does the work you need to do increase? With a linear search, doubling the size of the list will double the amount of work to do.

Note that looking up an element by its index or key is a constant function—it takes the same amount of time no matter how big the list or dictionary is. This is part of why dictionaries are so useful as look-up tables: you don’t need to spend time searching for the value if it you have its key!

7.4 Sorting

There are many, many different algorithms for sorting numbers, many of which have amusing names given to them by computer scientists (Python’s built-in sorted() method uses one called Timsort, named after the man who invented it). This chapter will discuss just one straightforward example for illustration purposes.

One such algorithm (Selection Sort) utilizes the “king-of-the-hill” search described above. This algorithm works as follows: search for (“select”) the smallest item in the list. Because it is the smallest, it must be the first item in the sorted list, and can be placed there. Next, select the second-smallest item in the list (the smallest of the “unsorted” items), and place that second in the “sorted” list. Continue with this process until the entire list is sorted!

def selection_sort(a_list):
    """Sorts the list (in place)"""

    for i in range(len(a_list)):  # go through each spot in the list
        # Do a "king-of-the-hill" search of the remaining items
        selected_index = i
        for j in range(i, len(a_list)):
            if(a_list[j] < a_list[selected_index]):
                selected_index = j

        # swap smallest into place (multi-assignment!)
        a_list[i], a_list[selected_index] = a_list[selected_index], a_list[i]

To determine the speed of the selection sort algorithm, notice that the first time through the loop requires considering N different items (the whole loop). The second time requires checking N-1 items, the third time N-2 items, and so forth until the last time through the loop you only need to compare 2 then 1 items. These checks can then be summed into a series:

 

Selection sort’s speed is a quadratic function of the size of the list: that is, if you double the size of the list, the number of comparisons you need to do quadruples!.

Faster sorting algorithms (like Python’s Timsort) get this speed down to “fast” N*log2(N) (“loglinear”), which is much better than quadratic algorithms but still notably slower than linear algorithms.

For large lists, sorting a list is slower than just using a linear search on it… but once that list is sorted, you can use the ultra-fast binary search! This is a tradeoff that needs to be considered when trying to improve the efficiency of programs that work on large data sets: sometimes you need to spend some extra time up front to prepare (sort) the data, in order to be able to utilize (search) it more effectively.