Ccmmutty logo
Commutty IT
16 min read

03.02 Data Indexing and Selection

https://picsum.photos/seed/b5d7478b9f7747bda84e76559bd0c84e/600/800
In Chapter 2, we looked in detail at methods and tools to access, set, and modify values in NumPy arrays. These included indexing (e.g., arr[2, 1]), slicing (e.g., arr[:, 1:5]), masking (e.g., arr[arr > 0]), fancy indexing (e.g., arr[0, [1, 5]]), and combinations thereof (e.g., arr[:, [1, 5]]). Here we'll look at similar means of accessing and modifying values in Pandas Series and DataFrame objects. If you have used the NumPy patterns, the corresponding patterns in Pandas will feel very familiar, though there are a few quirks to be aware of.
We'll start with the simple case of the one-dimensional Series object, and then move on to the more complicated two-dimesnional DataFrame object.

Data Selection in Series

As we saw in the previous section, a Series object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary. If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.

Series as dictionary

Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of values:
python
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data
a 0.25 b 0.50 c 0.75 d 1.00 dtype: float64
python
data['b']
0.5
We can also use dictionary-like Python expressions and methods to examine the keys/indices and values:
python
'a' in data
True
python
data.keys()
Index(['a', 'b', 'c', 'd'], dtype='object')
python
list(data.items())
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
Series objects can even be modified with a dictionary-like syntax. Just as you can extend a dictionary by assigning to a new key, you can extend a Series by assigning to a new index value:
python
data['e'] = 1.25
data
a 0.25 b 0.50 c 0.75 d 1.00 e 1.25 dtype: float64
This easy mutability of the objects is a convenient feature: under the hood, Pandas is making decisions about memory layout and data copying that might need to take place; the user generally does not need to worry about these issues.

Series as one-dimensional array

A Series builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays – that is, slices, masking, and fancy indexing. Examples of these are as follows:
python
# slicing by explicit index
data['a':'c']
a 0.25 b 0.50 c 0.75 dtype: float64
python
# slicing by implicit integer index
data[0:2]
a 0.25 b 0.50 dtype: float64
python
# masking
data[(data > 0.3) & (data < 0.8)]
b 0.50 c 0.75 dtype: float64
python
# fancy indexing
data[['a', 'e']]
a 0.25 e 1.25 dtype: float64
Among these, slicing may be the source of the most confusion. Notice that when slicing with an explicit index (i.e., data['a':'c']), the final index is included in the slice, while when slicing with an implicit index (i.e., data[0:2]), the final index is excluded from the slice.

Indexers: loc, iloc, and ix

These slicing and indexing conventions can be a source of confusion. For example, if your Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style index.
python
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data
1 a 3 b 5 c dtype: object
python
# explicit index when indexing
data[1]
'a'
python
# implicit index when slicing
data[1:3]
3 b 5 c dtype: object
Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series.
First, the loc attribute allows indexing and slicing that always references the explicit index:
python
data.loc[1]
'a'
python
data.loc[1:3]
1 a 3 b dtype: object
The iloc attribute allows indexing and slicing that always references the implicit Python-style index:
python
data.iloc[1]
'b'
python
data.iloc[1:3]
3 b 5 c dtype: object
A third indexing attribute, ix, is a hybrid of the two, and for Series objects is equivalent to standard []-based indexing. The purpose of the ix indexer will become more apparent in the context of DataFrame objects, which we will discuss in a moment.
One guiding principle of Python code is that "explicit is better than implicit." The explicit nature of loc and iloc make them very useful in maintaining clean and readable code; especially in the case of integer indexes, I recommend using these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention.

Data Selection in DataFrame

Recall that a DataFrame acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series structures sharing the same index. These analogies can be helpful to keep in mind as we explore data selection within this structure.

DataFrame as a dictionary

The first analogy we will consider is the DataFrame as a dictionary of related Series objects. Let's return to our example of areas and populations of states:
python
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data
The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name:
python
data['area']
California 423967 Florida 170312 Illinois 149995 New York 141297 Texas 695662 Name: area, dtype: int64
Equivalently, we can use attribute-style access with column names that are strings:
python
data.area
California 423967 Florida 170312 Illinois 149995 New York 141297 Texas 695662 Name: area, dtype: int64
This attribute-style column access actually accesses the exact same object as the dictionary-style access:
python
data.area is data['area']
True
Though this is a useful shorthand, keep in mind that it does not work for all cases! For example, if the column names are not strings, or if the column names conflict with methods of the DataFrame, this attribute-style access is not possible. For example, the DataFrame has a pop() method, so data.pop will point to this rather than the "pop" column:
python
data.pop is data['pop']
False
In particular, you should avoid the temptation to try column assignment via attribute (i.e., use data['pop'] = z rather than data.pop = z).
Like with the Series objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case adding a new column:
python
data['density'] = data['pop'] / data['area']
data
This shows a preview of the straightforward syntax of element-by-element arithmetic between Series objects; we'll dig into this further in Operating on Data in Pandas.

DataFrame as two-dimensional array

As mentioned previously, we can also view the DataFrame as an enhanced two-dimensional array. We can examine the raw underlying data array using the values attribute:
python
data.values
array([[ 4.23967000e+05, 3.83325210e+07, 9.04139261e+01], [ 1.70312000e+05, 1.95528600e+07, 1.14806121e+02], [ 1.49995000e+05, 1.28821350e+07, 8.58837628e+01], [ 1.41297000e+05, 1.96511270e+07, 1.39076746e+02], [ 6.95662000e+05, 2.64481930e+07, 3.80187404e+01]])
With this picture in mind, many familiar array-like observations can be done on the DataFrame itself. For example, we can transpose the full DataFrame to swap rows and columns:
python
data.T
When it comes to indexing of DataFrame objects, however, it is clear that the dictionary-style indexing of columns precludes our ability to simply treat it as a NumPy array. In particular, passing a single index to an array accesses a row:
python
data.values[0]
array([ 4.23967000e+05, 3.83325210e+07, 9.04139261e+01])
and passing a single "index" to a DataFrame accesses a column:
python
data['area']

Discussion

コメントにはログインが必要です。