How to Select Rows From a DataFrame Based on Column Values – Finxter

0
79
shubham finxter profile image


Problem Statement: Selecting rows from a Dataframe based on the column values.

Introduction

DataFrame (A Quick Recap)

A DataFrame is a 2-dimensional data structure that is generally immutable and heterogeneous. It has labelled axes – rows and columns. A Pandas Dataframe comprises three parts: data, rows, and columns.

Now let’s create a Dataframe. We are going to use the following Dataframe throughout the article.

import pandas as pd
df = pd.DataFrame(
 [
  (5, 10, 15, 20, 'x'),
  (1, 2, 3, 4, 'y'),
  (40, 50, 60, 70, 'z'),
 ],
 columns = ['A', 'B', 'C', 'D', 'E'],
)
print(df)
    A   B   C   D  E
0   5  10  15  20  x
1   1   2   3   4  y
2  40  50  60  70  z

Now that we have successfully created a Pandas Dataframe is let’s dive into the different methods that will demonstrate how to select rows from the DataFrame based on column values.

Note: We have selected numerous scenarios to discuss how we can select rows based on a variety of column conditions. Please follow along.

Scenario 1: Select Rows with Column Values Equal To A String Or A Scalar

When the column value is equal to a scalar or a string, we can use the loc() attribute to select only those rows which are equal to a specific value in a particular column.

Example 1: In the following code, we will select only those rows which have a specific value in a particular column.

import pandas as pd

df = pd.DataFrame(
    [
        (5, 10, 15, 20, 'x'),
        (1, 2, 3, 4, 'y'),
        (40, 50, 60, 70, 'z'),
    ],
    columns=['A', 'B', 'C', 'D', 'E'],
)
print(df)
print("Row with value 'z':")
val = df.loc[df['E'].str.contains('z')]
print(val)

Output:

    A   B   C   D  E
0   5  10  15  20  x
1   1   2   3   4  y
2  40  50  60  70  z
Row with value 'z':
    A   B   C   D  E
2  40  50  60  70  z

Discussion: In the above example, loc() enabled us to select only that the row where the column E has the value Z present in the row with the help of the str.contains() method.

Example 2: In the next example we will use the == operator to select the rows for which the column value is equal to a scalar.

import pandas as pd

df = pd.DataFrame(
    [
        (5, 10, 15, 20, 'x'),
        (1, 2, 3, 4, 'y'),
        (40, 50, 60, 70, 'z'),
    ],
    columns=['A', 'B', 'C', 'D', 'E'],
)
print(df)
print("Row with value '3':")
val = df.loc[df['C'] == 3]
print(val)

Output:

    A   B   C   D  E
0   5  10  15  20  x
1   1   2   3   4  y
2  40  50  60  70  z
Row with value '3':
   A  B  C  D  E
1  1  2  3  4  y

✨ We can even provide the boolean condition while indexing these Dataframes and hence omit the whole loc attribute. Look at the following example:

Example:

import pandas as pd

df = pd.DataFrame(
    [
        (5, 10, 15, 20, 'x'),
        (1, 2, 3, 4, 'y'),
        (40, 50, 60, 70, 'z'),
    ],
    columns=['A', 'B', 'C', 'D', 'E'],
)
print(df)
print("Row with value '50':")
print(df[df['B'] == 50])

Output:

    A   B   C   D  E
0   5  10  15  20  x
1   1   2   3   4  y
2  40  50  60  70  z
Row with value '50':
    A   B   C   D  E
2  40  50  60  70  z

Note:  Instead of using df['B'], the column B can also be referenced using df.B.

Example:

import pandas as pd

df = pd.DataFrame(
    [
        (5, 10, 15, 20, 'x'),
        (1, 2, 3, 4, 'y'),
        (40, 50, 60, 70, 'z'),
    ],
    columns=['A', 'B', 'C', 'D', 'E'],
)
print(df)
print("Row with value '4':")
mask = df.B == 50
print(df[mask])

Output:

    A   B   C   D  E
0   5  10  15  20  x
1   1   2   3   4  y
2  40  50  60  70  z
Row with value '4':
    A   B   C   D  E
2  40  50  60  70  z

Discussion: Generally, coders name and store such boolean conditions in a variable named mask which can then be passed to DataFrame while indexing it. However, it is advisable to use the loc() attribute when selecting the rows where the column value is equal to scalar or string. Using loc() method is a faster approach and also df[mask] will eventually be evaluated as df.loc[mask]. Thus, using loc directly saves time.

Scenario 2: When The Column Value Is Not Equal To A Scalar

We have to use the not operator (!) when we need to select a row where the column value is not equal to the scalar.

Example: In the following example we will select the rows where the column value is not equal to 1.

import pandas as pd

df = pd.DataFrame(
    [
        (5, 10, 15, 20, 'x'),
        (1, 2, 3, 4, 'y'),
        (40, 50, 60, 70, 'z'),
    ],
    columns=['A', 'B', 'C', 'D', 'E'],
)
print(df)
print("Rows without the value '1':")
val = df.loc[df['A'] != 1]
print(val)

Output:

    A   B   C   D  E
0   5  10  15  20  x
1   1   2   3   4  y
2  40  50  60  70  z
Rows without the value '1':
    A   B   C   D  E
0   5  10  15  20  x
2  40  50  60  70  z

Scenario 3: Dropping Rows With “None” Column Values

We can use the df.dropna() method to drop all the rows with column values equal to None.

Example:

import pandas as pd

df = pd.DataFrame(
    [
        (5, 10, 15, 20, 'x'),
        (None, 75, 85, 95, 'a'),
        (1, 2, 3, 4, 'y'),
        (40, 50, 60, 70, 'z'),
    ],
    columns=['A', 'B', 'C', 'D', 'E'],
)
print(df)
print("Rows without None values:")
print(df.dropna())

Output:

      A   B   C   D  E
0   5.0  10  15  20  x
1   NaN  75  85  95  a
2   1.0   2   3   4  y
3  40.0  50  60  70  z
Rows without None values:
      A   B   C   D  E
0   5.0  10  15  20  x
2   1.0   2   3   4  y
3  40.0  50  60  70  z

Scenario 4: Select Rows With Multiple Column Conditions

To select the rows based on multiple column conditions, we have to use <= and >= operators. Due to Python’s operator precedence rules, always ensure that you use the parentheses for each conditional expression. 

Example:

import pandas as pd

df = pd.DataFrame(
    [
        (5, 10, 15, 20, 'x'),
        (1, 2, 3, 4, 'y'),
        (40, 50, 60, 70, 'z'),
    ],
    columns=['A', 'B', 'C', 'D', 'E'],
)
print(df)
val = df.loc[(df['C'] >= 59) & (df['D'] <= 72)]
print(val)

Output:

    A   B   C   D  E
0   5  10  15  20  x
1   1   2   3   4  y
2  40  50  60  70  z
    A   B   C   D  E
2  40  50  60  70  z

Scenario 5: Select Rows With Column Value In An Iterable

In the following example we will learn how to select rows for which a particular value is contained within a list. To implement this, you just have to use the isin() method as shown in the following example.

Example: Select all rows where the C column has 15 or 60.

import pandas as pd

df = pd.DataFrame(
    [
        (5, 10, 15, 20, 'x'),
        (1, 2, 3, 4, 'y'),
        (40, 50, 60, 70, 'z'),
    ],
    columns=['A', 'B', 'C', 'D', 'E'],
)
print(df.loc[df['C'].isin([15, 60])])

Output:

    A   B   C   D  E
0   5  10  15  20  x
2  40  50  60  70  z

Conclusion

In this article, we learned different ways to select rows from a DataFrame based on column values. I hope this article has helped you. Please stay tuned and do subscribe for more such articles.



Source link

Leave a reply

Please enter your comment!
Please enter your name here