Skip to content

Review of whole repo#4

Open
rmccreath wants to merge 21 commits intoreview/basefrom
review/changes
Open

Review of whole repo#4
rmccreath wants to merge 21 commits intoreview/basefrom
review/changes

Conversation

@rmccreath
Copy link
Member

TBC


</br>

To open a new Jupyter Notebook, click the "New" drop down menu on the Files tab and select “Python 3.10.2” under the notebooks heading. This will open a blank Notebook with an IPython console running underneath it.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version of Python isn't available on the server anymore. It might be better to include details about Python versions in the 'Foundation' section, then in the IDE instructions, give details about what should be selected at a higher level rather than very specific (helps users know why, avoids the contents being outdated at the next update).

Comment on lines +815 to +841
### Row Indexes

By default, DataFrames come with a **RangeIndex** where the first row is labelled `0`, the second row is labelled `1`, and so on. We can call the `.index` attribute on a DataFrame:

```{python call-index, eval = FALSE, echo = TRUE}
df.index
```

If the DataFrame has the default index, the output will be

```{python index-output, eval = FALSE, echo = TRUE}
RangeIndex(start=0, stop=number of rows, step=1)
```

indicating that the row labels

- start at 0
- increase by 1 for each row
- end *before* reaching the total number of rows (since we started counting at 0)

For example, a DataFrame with 3 rows might have the index `RangeIndex(start=0, stop=3, step=1)`.

In this RangeIndex, we start labelling rows at 0, stop before 3, and increase by 1 each row:

- the first row has the label 0
- the second row has the label 1
- the third row has the label 2
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RangeIndex is a component of a Pandas object, a dataframe will default to use this but uses DataFrame.index to access. This feels like a lot of detail to describe indexing on an introduction course, I'm not sure there's a benefit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think indexing concept is used for iloc and loc sections later on. I'll leave this section here for now and we can discuss later.

Comment on lines +866 to +870
```{python sort-value2, exercise = TRUE, exercise.setup = "pandas-setup"}
borders.sort_values(
by = 'HospitalCode'
)
```
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inplace default is False so this won't update the data? It's probably important to mention that.


Sometimes, we are only interested in looking at certain rows and columns of a DataFrame. We can select sections of a DataFrame using two methods: `.loc[]` and `.iloc[]`.

We will explore how both of these work using a mini-DataFrame from `borders` data.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

mini_df.sort_values(by = 'Main_Condition')
```

Note that the row index is now out-of-order. This will be important later on.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: ?


print(hello_world)
```

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As things are quite different in Python, understanding how to interact with dataframes, like accessing specific columns, would be good to know. This is missing at the moment, but could be included here or in the 'Explore' section.

Comment on lines +898 to +916
#### Using `.loc[]`

The `.loc[]` index works using the index labels:

```{python loc, eval = FALSE, echo = TRUE}
df.loc[list_of_row_labels, list_of_column_labels]
```

For example, run the following code to see which rows are selected:

```{python mini-df-loc, exercise = TRUE, exercise.setup = "pandas-setup"}
mini_df = pd.read_csv("data/borders_inc_age.csv", usecols = ['URI', 'Main_Condition']).head(4)

mini_df_sort = mini_df.sort_values(by = 'Main_Condition')

mini_df_sort.loc[[0, 2],['Main_Condition']]
```

The first list (`[0, 2]`) instructs `.loc[]` to select the rows labelled `0` and `2` in order. The second list (`['Main_Condition']`) instructs `.loc[]` to select only the column labelled `Main_Condition`.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned above about having some detail on this as it's relevant before wrangling. However, I think it's probably worth explaining the use of [] to target columns as well as loc, and why loc can be considered an improvement.

Comment on lines +918 to +942
#### Using `.iloc[]`

The `.iloc[]` index works using the *position* of the rows and columns, instead of their labels:

```{python iloc, eval = FALSE, echo = TRUE}
df.iloc[list_of_row_positions, list_of_column_positions]
```

where

- both the row positions and column positions start at `0`
- rows are counted top to bottom
- columns are counted left to right

For example, run the following code to see which rows are selected:

```{python mini-df-iloc, exercise = TRUE, exercise.setup = "pandas-setup"}
mini_df = pd.read_csv("data/borders_inc_age.csv", usecols = ['URI', 'Main_Condition']).head(4)

mini_df_sort = mini_df.sort_values(by = 'Main_Condition')

mini_df_sort.iloc[[0, 2],[1]]
```

Here, the list (`[0, 2]`) instructs `.iloc[]` to select only the first row (position `0`) and the third row (position `2`). The column list `[1]` instructs `.iloc[]` to select the second column, which is the column labelled `Main_Condition`.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iloc is deprecated?

Comment on lines +944 to +1140
### Selecting Ranges of Rows

Instead of specifying lists of rows and columns with `.loc/.iloc`, we can specify a **slice**, which is like saying `select all rows from row A to row B`.

#### Slices with `.iloc`

A slice reproduces a consecutive list by specifying where to start and stop, using the syntax `start_position:stop_position`, where

- `start_position` is the first position to include
- `stop_position` is one position *beyond* the last position to include

For example, to reproduce our list `[0, 1, 2, 3, 4]`, we would use the slice `0:5` to indicate:

- start at position `0`
- stop before position `5`
- include all positions in-between (`1`, `2`, `3`, `4`)

We want to select the first 4 rows and the `HospitalCode`, `Specialty`, and `ManagementofPatient` (second to fourth) columns from `borders` after sorting the data by `Main_Condition`. We will need two slices, one for the rows and one for the columns:

- **rows**: we want to start at the first row (postion `0`) and end *before* the fifth row (position `4`). The slice is `0:4` and will include positions `0`, `1`, `2` and `3`.
- **columns**: we want to start at the second column (position `1`) and end *before* the fifth column (position `4`). The slice is `1:4` and will include positions `1`, `2` and `3`.

```{python slice-iloc, exercise = TRUE, exercise.setup = "pandas-setup"}
borders = borders.sort_values(by = 'Main_Condition')
borders.iloc[0:4,1:4]
```

#### Slices with `.loc`

Unlike `.iloc`, slices using `.loc` include the label after the `:` in the slice syntax: `start_label:stop_label`, which will include

- the row/column labelled `start_label`
- the row/column labelled `stop_label`
- all rows/columns in between

Let's use `.loc` to select the same section of `borders`. This is how the data looks like after soring by `Main_Condition`:

```{python slice-loc1, echo = FALSE}
borders = borders.sort_values(by = 'Main_Condition')
borders
```

Again we will need two slices:

- **rows**: the first four rows start at the label `5018` and end at `23180`. The slice is `5018:23180` and will include rows labelled `5018`, `6671`, `3420` and `23180`. Note how the `stop_label` is included, unlike with `.iloc`.
- **columns**: the first column to include is labelled `HospitalCode` and the last is `ManagementofPatient`. The slice is `'HospitalCode':'ManagementofPatient'` and will include columns labelled `HospitalCode`, `Specialty` and `ManagementofPatient`.

```{python slice-loc2, exercise = TRUE, exercise.setup = "pandas-setup"}
borders = borders.sort_values(by = 'Main_Condition')
borders.loc[5018:23180,'HospitalCode':'ManagementofPatient']
```

#### Open-Ended Slices

If we don't want to specify a start value, the slice will start at the first position/label. For example, the `.iloc` slice `:3` would include rows at positions `0`, `1` and `2`.

Similarly, if we don't specify an end value, the slice will end at the final position/label. For example, the `.loc` slice `'HospitalCode':` will include every column starting at `HospitalCode` through to the final column of the DataFrame.

If we specify neither the start nor the final value, the slice will include all rows and columns. For example, the code `borders.iloc[0:2,:]` will include rows at positions `0` and `1` along with every column.

### Boolean Masks

We often need to select data by properties, instead of by position/index. For example we might want to select all rows in `borders` corresponding to the HospitalCode as `B120H`. We can achieve this using Python's [comparison operators](#comparisonoperators) to compare values. Whenever we compare two variables using a comparison operator, Python returns a `True` if the comparison is correct, and a `False` if it is incorrect.

For example, here are a few comparisons and their output:

```{python boolean-mask1, echo = TRUE, eval = TRUE}
3 == 4
```

```{python boolean-mask2, echo = TRUE, eval = TRUE}
3 < 4
```

We can also use `==` to compare non-numeric variable types:

```{python boolean-mask3, echo = TRUE, eval = TRUE}
'car' == 'truck'
```

Pandas can perform comparisons for an entire column at once using the same comparisons. For example, to determine which rows of `HospitalCode` column contain `B120H`, we would use the code `df['HospitalCode'] == 'B120H'`. The output of this code is a Series with `True` for all the rows where the comparison is `True`:

```{python boolean-mask4, exercise = TRUE, exercise.setup = "pandas-setup"}
borders.head(10)['HospitalCode'] == 'B120H'
```

The output True/False Series is called a **Boolean mask**, which allows you to see only certain features of a column (e.g. whether the value is `B120H`) while obscuring others.

It can be helpful to assign a Boolean mask to a variable. For example let's assign the Boolean mask above to the variable `is_hosp`. It's good practice though not strictly necessary, to use parentheses to make this clearer:

```{python boolean-mask5, echo = TRUE}
is_hosp = (borders.head(10)['HospitalCode'] == 'B120H')
```

Note that two `=` signs performs the comparison, and one `=` performs the assignment to the variable.

### Filtering Rows with Booleans

A Boolean mask tells us which rows have a certain property. What we usually want is to actually filter the DataFrame down to only those rows where Boolean mask is `True`. We can pass the mask to the DataFrame:

```{python boolean-mask6, exercise = TRUE, exercise.setup = "pandas-setup"}
is_hosp = (borders.head(10)['HospitalCode'] == 'B120H')

borders.head(10)[is_hosp]
```

### Combining Booleans with And

We may want to filter the rows where patients are from hospital `B120H` with certain specialty (e.g. `A1`). Under this case the two comparisons can be combined with `and` to test if both are simultaneously true. Pandas uses the symbol `&` for this. Let's perform the row selection by combining the comparisons for the first 10 rows from `borders`:

```{python boolean-mask7, exercise = TRUE, exercise.setup = "pandas-setup"}
is_hosp = (borders.head(10)['HospitalCode'] == 'B120H') #assign the first condition to a variable

is_spec = (borders.head(10)['Specialty'] == 'A1') #assign the second condition to a variable

borders.head(10)[is_hosp & is_spec]
```

Note that if there are no rows match the conditions, filtering the DataFrame will produce a result with column labels in the header but no actual rows.

### Combining Booleans with Or

In Python, `or` behaves the same way: it outputs `True` as long as **one of** the combined comparisons is `True`. Pandas uses the symbol `|` instead of `or`. If we use `|` for the example above, which rows will be selected?

```{python boolean-mask8, exercise = TRUE, exercise.setup = "pandas-setup"}
is_hosp = (borders.head(10)['HospitalCode'] == 'B120H') #assign the first condition to a variable

is_spec = (borders.head(10)['Specialty'] == 'A1') #assign the second condition to a variable

borders.head(10)[is_hosp | is_spec]
```

We can see the rows where either hospital is `B120H` or specialty is `A1` are filtered this time, as long as one of the conditions is met.

### Inverting Booleans with Not

What if we want to select the rows where specialty is **not** `A1`? Pandas uses the symbol `~` for `not/non` to exclude the rows we don't want:

```{python boolean-mask9, exercise = TRUE, exercise.setup = "pandas-setup"}
is_spec = (borders.head(10)['Specialty'] == 'A1') #assign the condition to a variable

not_spec = ~is_spec #create a new Boolean mask to invert the pre-existing mask

borders.head(10)[not_spec]
```

You can see the rows with specialty `A1` are excluded. `~` simply swaps `True` and `False`.

### Add/Delete a Column

To add a new column, you can use

```{python add-column, eval = FALSE, echo = TRUE}
df[new_column_name] = row_value
```

This will create a new column with all the rows populated by the `row_value`.

Let's create a new column called `number_of_eyebrows` with the row_value as `2` for all the rows in the `borders` DataFrame.

```{python new-column, exercise = TRUE, exercise.setup = "pandas-setup"}
borders['number_of_eyebrows'] = 2

borders.head()
```

Sometimes we want to delete certain columns. We can use `drop` method to do so.

* Removing a column by label:

```{python remove-column1, eval = FALSE, echo = TRUE}
df = df.drop(list_of_column_labels, axis = 1)
# axis = 1 specifies you are removing a column
```

Let's load the first 10 rows of `borders`, and remove the first two columns `URI` and `HospitalCode` by their column labels (column names):

```{python delete-column1, exercise = TRUE, exercise.setup = "pandas-setup"}
borders_10 = borders.head(10)

borders_10.drop(['URI', 'HospitalCode'], axis = 1)
```

* Removing a column by index position:

```{python remove-column2, eval = FALSE, echo = TRUE}
df = df.drop(df.columns[list_of_column_positions], axis = 1)
# axis = 1 specifies you are removing a column
```

Let's replicate the last example by using this method:

```{python delete-column2, exercise = TRUE, exercise.setup = "pandas-setup"}
borders_10 = borders.head(10)

borders_10.drop(borders_10.columns[[0, 1]], axis = 1)
```
Copy link
Member Author

@rmccreath rmccreath Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this covers accessing a dataframe, including predetermined sections of it, then a large section of working with conditionals. There are many methods that will allow insights to the data, or preparation based on values, such as query(), isin(), or filter().
Why are the examples are only using the first 10 rows?


baby5.merge(baby6, how = 'left', on = ['FAMILYID','DOB'])
```

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty long course as it is, but including some details about where the go next, thinking visualisations and producing outputs. While we won't have courses for these soon, it would be good to include details about what packages are recommended at least.

```

## Help & Feedback

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help? Where should users go for learning more or knowing how to deal with bugs? What about the Data and Intelligence Forum - Python channel, what about references to further materials like PEP8?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants