Chapter 7 Searching and Filtering

One of the most common and important uses of computer programs is to search a large set of data for a specific record or observation—indeed, a significant portion of what computers do is search through data. This chapter covers a number of common patterns and algorithms used when searching through lists of data in order to answer questions about that data. Note that this chapter introduces no new syntax, but instead provides a deeper look at using loops and lists. The first two sections illustrates common interactions with lists, while the last considers at a high level the efficiency of algorithms.

7.2 Filtering

In addition to searching data for a specific item, you will often want to search a data set for all items that meet some criteria. For example, you want to find all items numbers that are large, or all sports games that the home team one, or all outliers in a medical diagnoses. This is called filtering for a set of data. Filtering is one of the most common steps used in data analysis; indeed the searching described in the previous section can be seen as a specialized filtering (where you’re filtering for a single element).

Importantly, when we talk about filtering in programming, we think about what items we want to select or keep, rather than what items we want to discard. We don’t try to “filter out the small numbers”, instead we try to “filter (select) for the big numbers”. A filter is about what is let in (a whitelist), not about what is excluded.

The simplest way to filter a data set is to use a loop similar to a linear search, but instead of finding a single element, you construct a new list that contains the elements that “pass” the filter and thus that you want to keep. This follows the basic template:

result_list = [] # initialize new list
for value in original_sequence: # loop through the items
    if value meets criteria: # if item "passes" the filter and should be kept
        result_list.append(value) # add value to new list

As a concrete example, you can filter a list of numbers for those that are “large” (e.g., greater than 10):

numbers = [17, 18, 3, 7, 11, 16, 13, 4] # the list to search

large_numbers = [] # the list for the items to find
for number in numbers: # loop through the values
    if(number > 10): # check if the value
        large_numbers.append(number)

print(large_numbers) # [17, 18, 11, 16, 13]

There are two important things to notice in this example.

First, filtering involves constructing a new list or sequence. This makes the code easier to write and process (you don’t need to worry about deleting things from the middle of a list), and is best practice in doing data analysis because you won’t modify or “lose” your original data, so that you can perform different filters on it (e.g., you could also filter for which numbers are even).

Second, the if statement’s expression checks if the value meets the desired criteria and should be added to the new list, not whether it should be skipped or removed. Filtering is a “positive” process. The filter criteria (the boolean expression) can be as complex as needed; you can check multiple criteria by using the and or or operators. And of course you can include additional statements or if statements if that helps make your code more understandable:

i_words = []
for word in word_list:
    lower_case_word = word.lower() # format for consideration
    if(lower_case_word.startswith('i')): # check starting letter
        if(not lower_case_word in i_words): # don't keep if already in list
            chosen_words.append(word) # keep original word

Of course, to use a more Pythonic approach you can do filtering using a list comprehension:

numbers = [17, 18, 3, 7, 11, 16, 13, 4] # the list to search
large_nubmers = [number for number in numbers if number > 0]
print(large_numbers) # [17, 18, 11, 16, 13]

Using an if statement in a list comprehensino performs a filtering operation!

Note that The i_words example cannot be written as a list comprehension, because the filtering work is dependent on the previous results—it is not considered a “pure” operation.

7.3 Mapping

You can’t talk about filtering without also mentioning its cousin, mapping. A mapping operation is one that takes an original list (e.g., of numbers) and produces a new list with each of the original elements transformed in a certain way (e.g., rounded). The operation “maps” each number into a new value in the new list. This is a common operation to apply: maybe you want to “transform” a list so that all the values are rounded or lowercase, or you want to map a list of words to a list of their lengths.

Mapping can be done using a similar looping process to filtering, but you add a transformed value to the new list (instead of the original value if it meets a filtering criteria):

numbers = [1.3, 2.7, 3.14, 4.8, 5.99] # the list to map
rounded_numbers = []  # a new list for the mapped values
for number in numbers: # loop through the original list
    transformed = round(number) # map the individual item
    rounded_numbers.append(transformed) # add transformed value to new list.

print(rounded_numbers)  # [1, 3, 3, 5, 6]

You might recognize this example—this kind of process is exactly what a list comprehension does! In fact, a list comprehension is a mapping operation—every comprehension is doing some kind of mapping (and likely some filtering at the same time):

numbers = [1.3, 2.7, 3.14, 4.8, 5.99] # the list to map
rounded_numbers = [round(number) for number in numbers]
print(rounded_numbers)  # [1, 3, 3, 5, 6]

7.4 Search Efficiency

Searching is something that you do a lot (and on bigger and bigger data sets!), so it’s worth considering: how fast is this process? How efficient is the algorithm? Is there possibly a more efficient way of searching?

Warning: AVOID PRE-MATURE OPTIMIZATION!! While this section and chapter discuss the “speed” of computer programs, you should avoid spending too much (or any!) time trying to make your program “as fast as possible”. The first step in any computer program is to make it function at all. Once you have it working, then you can worry about increasing the efficiency—and only if it is currently too slow for your purposes. Pre-mature optimization (trying to make it work fast before you make it work at all) is a major source of bugs and other problems.

One way of measuring the speed of an algorithm would be to time it, such as by using a stopwatch. Python includes modules that can be used to record the time, allowing you to get the “start” and “stop” time of the algorithm, and then calculate the elapsed duration. However, this wall-clock efficiency is highly dependent on the exact list being searched and on the computer that is executing the algorithm—if you’re also streaming videos while searching, the algorithm may run slower!

Instead, computer scientists measure the efficiency of a program by counting the number of operations that the algorithm does. This number will be independent of the data and machine, and so makes it easier to compare approaches. Specifically, to measure the efficiency of a search, you would consider the number of comparisons that need to be made between elements (e.g., how many elements you check before you find what you’re looking for), under the assumption that this is the most time-consuming part of the computer’s search algorithm.

7.4.1 Linear Search Speed

Linear search involves looking at each item in the list one at a time, so the number of “checks” that need to be made is dependent on the size of the list. And because the item you’re looking for may be either be at the beginning of the list (meaning you don’t search for very long) or at the end, you can consider both the “average” case (when it’s in the middle), as well as the “worst” case (when it’s at the end or not in the list at all!):

len(list) # comparisons (avg) # comparisons (worst)
10 5 10
20 10 20
50 25 50
100 50 100
1000 500 1000
N N/2 N

These numbers should be somewhat intuitive: because you’re looking at each element in a line, in the average case you need to look at half of the elements, and in the worst case you need to look at all of them! As such, the number of comparisons you need to make (and thus the efficiency of the algorithm) is a linear function of the size of the list:

Linear time complexity. Image by Nick Salloum.

This is in fact why it is called a linear search!

In general, we measure algorithm efficiency (or more properly, algorithmic complexity) in terms of the rate of change in the speed: that is, if you double the size of the input list, by what ratio does the work you need to do increase? With a linear search, doubling the size of the list will double the amount of work to do.

Note that looking up an element by its index or key is a constant function—it takes the same amount of time no matter how big the list or dictionary is. This is part of why dictionaries are so useful as look-up tables: you don’t need to spend time searching for the value if it you have its key!

7.4.3 Slower Algorithms: Sorting

There are many, many different algorithms for sorting numbers, many of which have amusing names given to them by computer scientists (Python’s built-in sorted() method uses one called Timsort, named after the man who invented it). This chapter will discuss just one straightforward example for illustration purposes.

One such algorithm (Selection Sort) utilizes the “king-of-the-hill” search described above. This algorithm works as follows: search for (“select”) the smallest item in the list. Because it is the smallest, it must be the first item in the sorted list, and can be placed there. Next, select the second-smallest item in the list (the smallest of the “unsorted” items), and place that second in the “sorted” list. Continue with this process until the entire list is sorted!

def selection_sort(a_list):
    """Sorts the list (in place)"""

    for i in range(len(a_list)):  # go through each spot in the list
        # Do a "king-of-the-hill" search of the remaining items
        selected_index = i
        for j in range(i, len(a_list)):
            if(a_list[j] < a_list[selected_index]):
                selected_index = j

        # swap smallest into place (multi-assignment!)
        a_list[i], a_list[selected_index] = a_list[selected_index], a_list[i]

To determine the speed of the selection sort algorithm, notice that the first time through the loop requires considering N different items (the whole loop). The second time requires checking N-1 items, the third time N-2 items, and so forth until the last time through the loop you only need to compare 2 then 1 items. These checks can then be summed into a series:

 

Selection sort’s speed is a quadratic function of the size of the list: that is, if you double the size of the list, the number of comparisons you need to do quadruples!.

Faster sorting algorithms (like Python’s Timsort) get this speed down to “fast” N*log2(N) (“loglinear”), which is much better than quadratic algorithms but still notably slower than linear algorithms.

For large lists, sorting a list is slower than just using a linear search on it… but once that list is sorted, you can use the ultra-fast binary search! This is a tradeoff that needs to be considered when trying to improve the efficiency of programs that work on large data sets: sometimes you need to spend some extra time up front to prepare (sort) the data, in order to be able to utilize (search) it more effectively.