Chapter 10 Accessing Web APIs

You’ve used Python to load data from external files (either text files or locally-saved .csv files), but it is also possible to programmatically download data directly from web sites on the internet. This allows scripts to always work with the latest data available, performing analysis on data that may be changing rapidly (such as from social networks or other live events). Web services may make their data easily accessible to computer programs like Python scripts by offering an Application Programming Interface (API). A web service’s API specifies where and how particular data may be accessed, and many web services follow a particular style known as Representational State Transfer (REST). This chapter will cover how to access and work with data from these RESTful APIs.

10.1 Web APIs

An interface is the point at which two different systems meet and communicate: exchanging informations and instructions. An Application Programming Interface (API) thus represents a way of communicating with a computer application by writing a computer program (a set of formal instructions understandable by a machine). APIs commonly take the form of functions that can be called to give instructions to programs—the set of functions provided by a module like math or turtle make up the API for that module. While most APIs provide an interface for utilizing functionality, other APIs provide an interface for accessing data. One of the most common sources of these data apis are web services: websites that offer an interface for accessing their data.

  • (Technically the interface is just the function signature which says how you use the function: what name it has, what arguments it takes, and what value it returns. The actual module is an implementation of this interface).

With web services, the interface (the set of “functions” you can call to access the data) takes the form of HTTP Requests—that is, a request for data sent following the HyperText Transfer Protocol. This is the same protocol (way of communicating) used by your browser to view a web page! An HTTP Request represents a message that your computer sends to a web server (another computer on the internet which “serves”, or provides, information). That server, upon receiving the request, will determine what data to include in the response it sends back to the requesting computer. With a web browser the response data takes the form of HTML files that the browser can render as web pages; with data APIs the response data will be structured data that you can convert into structures such as lists or dictionaries.

In short, loading data from a Web API involves sending an HTTP Request to a server for a particular piece of data, and then receiving and parsing the response to that request.

10.2 RESTful Requests

There are two parts to a request sent to an API: the name of the resource (data) that you wish to access, and a verb indicating what you want to do with that resource. In many ways, the verb is the function you want to call on the API, and the resource is an argument to that function.

10.2.1 URIs

Which resource you want to access is specified with a Uniform Resource Identifier (URI). A URI is a generalization of a URL (Uniform Resource Locator)—what you commonly think of as “web addresses”. URIs act a lot like the address on a postal letter sent within a large organization such as a university: you indicate the business address as well as the department and the person, and will get a different response (and different data) from Alice in Accounting than from Sally in Sales.

  • Note that the URI is the identifier (think: variable name) for the resource, while the resource is the actual data value that you want to access.

Like postal letter addresses, URIs have a very specific format used to direct the request to the right resource.

The format (schema) of a URI.

Not all parts of the format are required—for example, you don’t need a port, query, or fragment. Important parts of the format include:

  • scheme (protocol): the “language” that the computer will use to communicate the request to this resource. With web services this is normally https (secure HTTP)
  • domain: the address of the web server to request information from
  • path: which resource on that web server you wish to access. This may be the name of a file with an extension if you’re trying to access a particular file, but with web services it often just looks like a folder path!
  • query: extra parameters (arguments) about what resource to access.

The domain and path usually specify the resource. For example, www.domain.com/users might be an identifier for a resource which is a list of users. Note that web services can also have “subresources” by adding extra pieces to the path: www.domain.com/users/joel might refer to the specific “joel” user in that list.

With an API, the domain and path are often viewed as being broken up into two parts:

  • The Base URI is the domain and part of the path that is included on all resources. It acts as the “root” for any particular resource. For example, the GitHub API has a base URI of https://api.github.com/.

  • An Endpoint, or which resource on that domain you want to access. Each API will have many different endpoints.

    Note that most resources include multiple subresource endpoints. For example, you can access information about a specific user at the endpoint /users/:username—the colon : indicates that the subresource name is a variable—you can replace that word in the endpoint with whatever string you want. Thus if you were interested in the GitHub user joelwross, you would access the /users/joelwross endpoint.

    Subresources may have further subresources (which may or may not have variable names). The endpoint /orgs/:org/repos refers to the list of repositories belonging to an organization.

    Variable names in resources can also be written inside of curly braces {}; for example, /orgs/{org}/repos. Neither the colon nor the braces is programming language syntax, just common conventions used to communicate endpoints.

The endpoint is appended to the end of the base URI, so you could access a GitHub user by combining the base URI (https://api.github.com) and endpoint (/users/joelwross) into a single string: https://api.github.com/users/joelwross. That URL will return a data structure of information about the GitHub user, which you can request from a Python program or simply view in your web-browser. Thus you can equivalently talk about accessing a particular resource and sending a request to a particular endpoint.

10.2.1.1 Query Parameters

Often in order to access only partial sets of data from a resource (e.g., to only get some users) you also include a set of query parameters. These are like extra arguments that are given to the request function. Query parameters are listed after a question mark ? in the URI, and are formed as key-value pairs similar to how you named items in lists. The key (parameter name) is listed first, followed by an equal sign =, followed by the value (parameter value); note that you can’t include any spaces in URIs! You can include multiple query parameters by putting an ampersand & between each key-value pair:

?firstParam=firstValue&secondParam=secondValue&thirdParam=thirdValue

Exactly what parameter names you need to include (and what are legal values to assign to that name) depends on the particular web service. Common examples include having parameters named q or query for searching, with a value being whatever term you want to search for: in https://www.google.com/search?q=informatics, the resource at the /search endpoint takes a query parameter q with the term you want to search for!

10.2.1.2 Access Tokens and API Keys

Many web services require you to register with them in order to send them requests. This allows them to limit access to the data, as well as to keep track of who is asking for what data (usually so that if someone starts “spamming” the service, they can be blocked).

To facilitate this tracking, many services provide Access Tokens (also called API Keys). These are unique strings of letters and numbers that identify a particular developer (like a secret password that only works for you). Web services will require you to include your access token as a query parameter in the request; the exact name of the parameter varies, but it often looks like access_token or api_key. When exploring a web service, keep an eye out for whether they require such tokens.

Access tokens act a lot like passwords; you will want to keep them secret and not share them with others. This means that you should not include them in your committed files, so that the passwords don’t get pushed to GitHub and shared with the world. The best way to do this in Python is to create a separate script file in your repo (e.g., apikeys.py) which includes exactly one line: assigning the key to a variable:

## in `apikeys.py`
my_api_key = "123456789abcdefg"

You can then include this filename in a .gitignore file in your repo; that will keep it from even possibly being committed with your code!

In order to access this variable in your “main” script, you can import this file as a module. The module is the name of the file, and you can access specific variables from it to include them in the “global” scope. Note that importing a file as a module will execute each line of code in that module (that isn’t in a “main” block):

## in `my_script.py`
from apikeys import my_api_key

print(my_api_key)  # key is now available!

Note that this assumes the apikeys.py file is inside the same folder as the script being run. See the documentation for details on how to handle other options.

Anyone else who runs the script will simply need to provide a my_api_key variable to access the API using their key, keeping everyone’s accounts separate and private!

Watch out for APIs that mention using OAuth when explaining API keys. OAuth is a system for performing authentification—that is, letting you or a user log into a website from your application (like what a “Log in with Facebook” button does). OAuth systems require more than one access key, and these keys must be kept secret and usually require you to run a web server to utilize them correctly (which requires lots of extra setup, see requests_oauthlib library for details). So for this book, we encourage you to avoid anything that needs OAuth

10.2.2 HTTP Verbs

When you send a request to a particular resource, you need to indicate what you want to do with that resource. When you load web content, you are typically sending a request to retrieve information (logically, this is a GET request). However, there are other actions you can perform to modify the data structure on the server. This is done by specifying an HTTP Verb in the request. The HTTP protocol supports the following verbs:

  • GET Return a representation of the current state of the resource
  • POST Add a new subresource (e.g., insert a record)
  • PUT Update the resource to have a new state
  • PATCH Update a portion of the resource’s state
  • DELETE Remove the resource
  • OPTIONS Return the set of methods that can be performed on the resource

By far the most common verb is GET, which is used to “get” (download) data from a web service. Depending on how you connect to your API (i.e., which programming language you are using), you’ll specify the verb of interest to indicate what we want to do to a particular resource.

Overall, this structure of treating each datum on the web as a resource which we can interact with via HTTP Requests is referred to as the REST Architecture (REST stands for REpresentational State Transfer). This is a standard way of structuring computer applications that allows them to be interacted with in the same way as everyday websites. Thus a web service that enabled data access through named resources and responds to HTTP requests is known as a RESTful service, with a RESTful API.

10.3 Accessing Web APIs

To access a Web API, you just need to send an HTTP Request to a particular URI! You can easily do this with the browser: simple navigate to a particular address (base URI + endpoint), and that will cause the browser to send a GET request and display the resulting data in the browser. For example, you can send a request to search GitHub for repositories named d3 by visiting:

https://api.github.com/search/repositories?q=d3&sort=stars

This query accesses the /search/repositories/ endpoint, and also specifies 2 query parameters:

  • q: The term(s) you are searching for, and
  • sort: The attribute of each repository that you would like to use to sort the results

(Note that the data you’ll get back is structured in JSON format. See below for details).

In Python you can most easily send GET requests using the requests module. This is a third-party library (not built into Python!), but is installed by default with Anaconda and so can just be imported:

import requests

This library provides a number of functions that reflect HTTP verbs. For example, the get() function will send an HTTP GET Request to the specified URI:

response = requests.get("https://api.github.com/search/repositories?q=d3&sort=stars")

While it is possible to include query parameters in the URI, requests also allows you to include them as a dictionary, making it easy to set and change variables (instead of needing to do complex String concatenation):

resource_uri = "https://api.github.com/search/repositories"
query_params = {'q': 'd3', 'sort': 'stars'}
response = requests.get(resource_uri, params = query_params)

Pro tip: You can access the response.url property to see the URI where a request was sent to. This is useful for debugging: print out the URI that you constructed to paste it into your browser!

If you try printing out the response variable, you’ll just see a simple output: <Response [200]>. The 200 is a an HTTP Status Code: an integer giving details about how the request went (200 means “OK” the request was answered successfully. Other common codes are 404 “Not Found” if the resource isn’t there, or 403 “Forbidden” if you don’t have permission to access the resource, such as if your API key isn’t specified correctly).

HTTP Status Codes are part of the response header. Each response has two parts: the header, and the body. You can think of the response as a envelope: the header contains meta-data like the address and postage date, while the body contains the actual contents of the letter (the data).

  • You can view all of the headers in the response.headers property.

Since you’re almost always interested in working with the body, you will need to extract that data from the response (e.g., open up the envelope and pull out the letter). You can get this content as the text property:

# extract content from response, as a text string
body = response.text

10.4 JSON Data

If you print out this text, you’ll see what looks like an oddly-formatted dictionary. This is because most APIs will return data in JavaScript Object Notation (JSON) format. Like CSV, this is a format for writing down structured data—but while .csv files organize data into rows and columns, JSON allows you to organize elements into sequences and key-value pairs—similar to Python lists and dictionaries!

JSON format looks a lot like the literal format for Python lists and dictionaries. For example, consider the below JSON notation for Programmer Ada:

{
  "first_name": "Ada",
  "job": "Programmer",
  "salary": 78000,
  "in_union": true,
  "pets": ["rover", "fluffy", "mittens"],
  "favorites": {
    "music": "jazz",
    "food": "pizza",
    "numbers": [12, 42]
  }
}

JSON uses curly braces {} to define sets of key-value pairs (called objects, not “dictionaries”), and square brackets [] to define ordered sequences (called arrays, not “lists”). Like in Python, keys and values are separated with colons (:), and elements are separated by commas (,). As in Python, key-value pairs are often written on separate lines for readability, but this isn’t required.

Note that unlike in Python dictionaries, keys must be strings (so in quotes), while values can either be strings, numbers, booleans (written in all lower-case as true and false), arrays (list), or other objects (dictionaries). Thus JSON can be seen as a way of writing normal Python dictionaries (or lists of dictionaries), just with a few restrictions on key and value type.

Pro-tip: JSON data can be quite messy when viewed in your web-browser. Installing a browser extension such as JSONView will format JSON responses in a more readable way, and even enable you to interactively explore the data structure.

JSON response data provided by the requests library can automatically be converted into the Python lists and dictionaries you are used to by using the .json() function:

resource_uri = "https://api.github.com/search/repositories"
query_params = {'q': 'd3', 'sort': 'stars'}
response = requests.get(resource_uri, params = query_params)
data = response.json()
type(data)  # <class 'dict'>
            # may also be <class 'list'> for some resources

Note that it is also possible to use Python to convert from (and to!) a “JSON String” into standard data types like list or dictionaries using the json module:

import json

# Convert to dictionary or list
data = json.loads(json_string)

# Convert to json
my_data = {'a':1, 'b':2}
my_json = json.dumps(my_data)

Note that these method names end in s: they are the “load string” and “dump string” methods (load and dump are for files!)

In practice, web APIs often return highly nested JSON objects (lots of dictionaries in dictionaries); you may need to go a “few levels deep” in order to access the data you really care about. Use the keys() dictionary method (or your web browser!) to inspect the returned data structure and find the values you are interested in. Note that it can be useful to “flatten” the data into a simpler structure for further access.

For example:

resource_uri = "https://api.github.com/search/repositories"
query_params = {'q': 'd3', 'sort': 'stars'}
response = requests.get(resource_uri, params = query_params)
data = response.json()
print(data.keys())  # dict_keys(['total_count', 'incomplete_results', 'items'])
    # looking at the JSON data itself (e.g., in the browser), `items` is the key
    # that contains the value you want

result_items = data['items']  # "flatten" into new variable
print(len(result_items))  # 30, number of results returned