Review of whole repo by rmccreath · Pull Request #4 · Public-Health-Scotland/python-training

rmccreath · 2025-08-21T13:11:18Z

TBC

Pandas Module was not found.

…ge files.

…osstab/margin

Started Location Section

Added joining image. Completed sections on groupby and aggregate.

Adding rename and recode section.

Python Merge Request

Review from Tina

intro.Rmd

rmccreath · 2025-08-25T11:30:26Z

intro.Rmd

+
+</br>
+
+To open a new Jupyter Notebook, click the "New" drop down menu on the Files tab and select “Python 3.10.2” under the notebooks heading. This will open a blank Notebook with an IPython console running underneath it. 


This version of Python isn't available on the server anymore. It might be better to include details about Python versions in the 'Foundation' section, then in the IDE instructions, give details about what should be selected at a higher level rather than very specific (helps users know why, avoids the contents being outdated at the next update).

intro.Rmd

rmccreath · 2025-08-25T13:58:14Z

intro.Rmd

+### Row Indexes
+
+By default, DataFrames come with a **RangeIndex** where the first row is labelled `0`, the second row is labelled `1`, and so on. We can call the `.index` attribute on a DataFrame:
+
+```{python call-index, eval = FALSE, echo = TRUE}
+df.index
+```
+
+If the DataFrame has the default index, the output will be 
+
+```{python index-output, eval = FALSE, echo = TRUE}
+RangeIndex(start=0, stop=number of rows, step=1)
+```
+
+indicating that the row labels
+
+- start at 0
+- increase by 1 for each row
+- end *before* reaching the total number of rows (since we started counting at 0)
+
+For example, a DataFrame with 3 rows might have the index `RangeIndex(start=0, stop=3, step=1)`.
+
+In this RangeIndex, we start labelling rows at 0, stop before 3, and increase by 1 each row:
+
+- the first row has the label 0
+- the second row has the label 1
+- the third row has the label 2


The RangeIndex is a component of a Pandas object, a dataframe will default to use this but uses DataFrame.index to access. This feels like a lot of detail to describe indexing on an introduction course, I'm not sure there's a benefit?

I think indexing concept is used for iloc and loc sections later on. I'll leave this section here for now and we can discuss later.

intro.Rmd

rmccreath · 2025-08-25T14:10:51Z

intro.Rmd

+```{python sort-value2, exercise = TRUE, exercise.setup = "pandas-setup"}
+borders.sort_values(
+  by = 'HospitalCode'
+)
+```


The inplace default is False so this won't update the data? It's probably important to mention that.

rmccreath · 2025-08-25T14:12:02Z

intro.Rmd

+
+Sometimes, we are only interested in looking at certain rows and columns of a DataFrame. We can select sections of a DataFrame using two methods: `.loc[]` and `.iloc[]`. 
+
+We will explore how both of these work using a mini-DataFrame from `borders` data. 


intro.Rmd

rmccreath · 2025-08-25T14:14:34Z

intro.Rmd

+mini_df.sort_values(by = 'Main_Condition')
+```
+
+Note that the row index is now out-of-order. This will be important later on. 


rmccreath · 2025-08-25T14:24:19Z

intro.Rmd

+
+print(hello_world)
+```
+


As things are quite different in Python, understanding how to interact with dataframes, like accessing specific columns, would be good to know. This is missing at the moment, but could be included here or in the 'Explore' section.

rmccreath · 2025-08-25T14:35:05Z

intro.Rmd

+#### Using `.loc[]`
+
+The `.loc[]` index works using the index labels:
+
+```{python loc, eval = FALSE, echo = TRUE}
+df.loc[list_of_row_labels, list_of_column_labels]
+```
+
+For example, run the following code to see which rows are selected:
+
+```{python mini-df-loc, exercise = TRUE, exercise.setup = "pandas-setup"}
+mini_df = pd.read_csv("data/borders_inc_age.csv", usecols = ['URI', 'Main_Condition']).head(4)
+
+mini_df_sort = mini_df.sort_values(by = 'Main_Condition')
+
+mini_df_sort.loc[[0, 2],['Main_Condition']]
+```
+
+The first list (`[0, 2]`) instructs `.loc[]` to select the rows labelled `0` and `2` in order. The second list (`['Main_Condition']`) instructs `.loc[]` to select only the column labelled `Main_Condition`. 


I mentioned above about having some detail on this as it's relevant before wrangling. However, I think it's probably worth explaining the use of [] to target columns as well as loc, and why loc can be considered an improvement.

rmccreath · 2025-08-25T14:37:47Z

intro.Rmd

+#### Using `.iloc[]`
+
+The `.iloc[]` index works using the *position* of the rows and columns, instead of their labels:
+
+```{python iloc, eval = FALSE, echo = TRUE}
+df.iloc[list_of_row_positions, list_of_column_positions]
+```
+
+where
+
+- both the row positions and column positions start at `0`
+- rows are counted top to bottom
+- columns are counted left to right
+
+For example, run the following code to see which rows are selected:
+
+```{python mini-df-iloc, exercise = TRUE, exercise.setup = "pandas-setup"}
+mini_df = pd.read_csv("data/borders_inc_age.csv", usecols = ['URI', 'Main_Condition']).head(4)
+
+mini_df_sort = mini_df.sort_values(by = 'Main_Condition')
+
+mini_df_sort.iloc[[0, 2],[1]]
+```
+
+Here, the list (`[0, 2]`) instructs `.iloc[]` to select only the first row (position `0`) and the third row (position `2`). The column list `[1]` instructs `.iloc[]` to select the second column, which is the column labelled `Main_Condition`. 


iloc is deprecated?

rmccreath · 2025-08-25T14:48:08Z

intro.Rmd

+### Selecting Ranges of Rows
+
+Instead of specifying lists of rows and columns with `.loc/.iloc`, we can specify a **slice**, which is like saying `select all rows from row A to row B`. 
+
+#### Slices with `.iloc`
+
+A slice reproduces a consecutive list by specifying where to start and stop, using the syntax `start_position:stop_position`, where
+
+- `start_position` is the first position to include
+- `stop_position` is one position *beyond* the last position to include
+
+For example, to reproduce our list `[0, 1, 2, 3, 4]`, we would use the slice `0:5` to indicate:
+
+- start at position `0`
+- stop before position `5`
+- include all positions in-between (`1`, `2`, `3`, `4`)
+
+We want to select the first 4 rows and the `HospitalCode`, `Specialty`, and `ManagementofPatient` (second to fourth) columns from `borders` after sorting the data by `Main_Condition`. We will need two slices, one for the rows and one for the columns:
+
+- **rows**: we want to start at the first row (postion `0`) and end *before* the fifth row (position `4`). The slice is `0:4` and will include positions `0`, `1`, `2` and `3`. 
+- **columns**: we want to start at the second column (position `1`) and end *before* the fifth column (position `4`). The slice is `1:4` and will include positions `1`, `2` and `3`. 
+
+```{python slice-iloc, exercise = TRUE, exercise.setup = "pandas-setup"}
+borders = borders.sort_values(by = 'Main_Condition')
+borders.iloc[0:4,1:4]
+```
+
+#### Slices with `.loc`
+
+Unlike `.iloc`, slices using `.loc` include the label after the `:` in the slice syntax: `start_label:stop_label`, which will include
+
+- the row/column labelled `start_label`
+- the row/column labelled `stop_label`
+- all rows/columns in between
+
+Let's use `.loc` to select the same section of `borders`. This is how the data looks like after soring by `Main_Condition`:
+
+```{python slice-loc1, echo = FALSE}
+borders = borders.sort_values(by = 'Main_Condition')
+borders
+```
+
+Again we will need two slices:
+
+- **rows**: the first four rows start at the label `5018` and end at `23180`. The slice is `5018:23180` and will include rows labelled `5018`, `6671`, `3420` and `23180`. Note how the `stop_label` is included, unlike with `.iloc`. 
+- **columns**: the first column to include is labelled `HospitalCode` and the last is `ManagementofPatient`. The slice is `'HospitalCode':'ManagementofPatient'` and will include columns labelled `HospitalCode`, `Specialty` and `ManagementofPatient`.
+
+```{python slice-loc2, exercise = TRUE, exercise.setup = "pandas-setup"}
+borders = borders.sort_values(by = 'Main_Condition')
+borders.loc[5018:23180,'HospitalCode':'ManagementofPatient']
+```
+
+#### Open-Ended Slices
+
+If we don't want to specify a start value, the slice will start at the first position/label. For example, the `.iloc` slice `:3` would include rows at positions `0`, `1` and `2`. 
+
+Similarly, if we don't specify an end value, the slice will end at the final position/label. For example, the `.loc` slice `'HospitalCode':` will include every column starting at `HospitalCode` through to the final column of the DataFrame. 
+
+If we specify neither the start nor the final value, the slice will include all rows and columns. For example, the code `borders.iloc[0:2,:]` will include rows at positions `0` and `1` along with every column. 
+
+### Boolean Masks
+
+We often need to select data by properties, instead of by position/index. For example we might want to select all rows in `borders` corresponding to the HospitalCode as `B120H`. We can achieve this using Python's [comparison operators](#comparisonoperators) to compare values. Whenever we compare two variables using a comparison operator, Python returns a `True` if the comparison is correct, and a `False` if it is incorrect. 
+
+For example, here are a few comparisons and their output:
+
+```{python boolean-mask1, echo = TRUE, eval = TRUE}
+3 == 4
+```
+
+```{python boolean-mask2, echo = TRUE, eval = TRUE}
+3 < 4
+```
+
+We can also use `==` to compare non-numeric variable types:
+
+```{python boolean-mask3, echo = TRUE, eval = TRUE}
+'car' == 'truck'
+```
+
+Pandas can perform comparisons for an entire column at once using the same comparisons. For example, to determine which rows of `HospitalCode` column contain `B120H`, we would use the code `df['HospitalCode'] == 'B120H'`. The output of this code is a Series with `True` for all the rows where the comparison is `True`:
+
+```{python boolean-mask4, exercise = TRUE, exercise.setup = "pandas-setup"}
+borders.head(10)['HospitalCode'] == 'B120H'
+```
+
+The output True/False Series is called a **Boolean mask**, which allows you to see only certain features of a column (e.g. whether the value is `B120H`) while obscuring others. 
+
+It can be helpful to assign a Boolean mask to a variable. For example let's assign the Boolean mask above to the variable `is_hosp`. It's good practice though not strictly necessary, to use parentheses to make this clearer:
+
+```{python boolean-mask5, echo = TRUE}
+is_hosp = (borders.head(10)['HospitalCode'] == 'B120H')
+```
+
+Note that two `=` signs performs the comparison, and one `=` performs the assignment to the variable. 
+
+### Filtering Rows with Booleans
+
+A Boolean mask tells us which rows have a certain property. What we usually want is to actually filter the DataFrame down to only those rows where Boolean mask is `True`. We can pass the mask to the DataFrame:
+
+```{python boolean-mask6, exercise = TRUE, exercise.setup = "pandas-setup"}
+is_hosp = (borders.head(10)['HospitalCode'] == 'B120H')
+
+borders.head(10)[is_hosp]
+```
+
+### Combining Booleans with And
+
+We may want to filter the rows where patients are from hospital `B120H` with certain specialty (e.g. `A1`). Under this case the two comparisons can be combined with `and` to test if both are simultaneously true. Pandas uses the symbol `&` for this. Let's perform the row selection by combining the comparisons for the first 10 rows from `borders`:
+
+```{python boolean-mask7, exercise = TRUE, exercise.setup = "pandas-setup"}
+is_hosp = (borders.head(10)['HospitalCode'] == 'B120H') #assign the first condition to a variable
+
+is_spec = (borders.head(10)['Specialty'] == 'A1') #assign the second condition to a variable
+
+borders.head(10)[is_hosp & is_spec]
+```
+
+Note that if there are no rows match the conditions, filtering the DataFrame will produce a result with column labels in the header but no actual rows. 
+
+### Combining Booleans with Or
+
+In Python, `or` behaves the same way: it outputs `True` as long as **one of** the combined comparisons is `True`. Pandas uses the symbol `|` instead of `or`. If we use `|` for the example above, which rows will be selected?
+
+```{python boolean-mask8, exercise = TRUE, exercise.setup = "pandas-setup"}
+is_hosp = (borders.head(10)['HospitalCode'] == 'B120H') #assign the first condition to a variable
+
+is_spec = (borders.head(10)['Specialty'] == 'A1') #assign the second condition to a variable
+
+borders.head(10)[is_hosp | is_spec]
+```
+
+We can see the rows where either hospital is `B120H` or specialty is `A1` are filtered this time, as long as one of the conditions is met. 
+
+### Inverting Booleans with Not
+
+What if we want to select the rows where specialty is **not** `A1`? Pandas uses the symbol `~` for `not/non` to exclude the rows we don't want:
+
+```{python boolean-mask9, exercise = TRUE, exercise.setup = "pandas-setup"}
+is_spec = (borders.head(10)['Specialty'] == 'A1') #assign the condition to a variable
+
+not_spec = ~is_spec #create a new Boolean mask to invert the pre-existing mask
+
+borders.head(10)[not_spec]
+```
+
+You can see the rows with specialty `A1` are excluded. `~` simply swaps `True` and `False`. 
+
+### Add/Delete a Column
+
+To add a new column, you can use
+
+```{python add-column, eval = FALSE, echo = TRUE}
+df[new_column_name] = row_value
+```
+
+This will create a new column with all the rows populated by the `row_value`.
+
+Let's create a new column called `number_of_eyebrows` with the row_value as `2` for all the rows in the `borders` DataFrame.
+
+```{python new-column, exercise = TRUE, exercise.setup = "pandas-setup"}
+borders['number_of_eyebrows'] = 2
+
+borders.head()
+```
+
+Sometimes we want to delete certain columns. We can use `drop` method to do so. 
+
+* Removing a column by label: 
+
+```{python remove-column1, eval = FALSE, echo = TRUE}
+df = df.drop(list_of_column_labels, axis = 1) 
+# axis = 1 specifies you are removing a column
+```
+
+Let's load the first 10 rows of `borders`, and remove the first two columns `URI` and `HospitalCode` by their column labels (column names):
+
+```{python delete-column1, exercise = TRUE, exercise.setup = "pandas-setup"}
+borders_10 = borders.head(10)
+
+borders_10.drop(['URI', 'HospitalCode'], axis = 1)
+```
+
+* Removing a column by index position: 
+
+```{python remove-column2, eval = FALSE, echo = TRUE}
+df = df.drop(df.columns[list_of_column_positions], axis = 1) 
+# axis = 1 specifies you are removing a column
+```
+
+Let's replicate the last example by using this method:
+
+```{python delete-column2, exercise = TRUE, exercise.setup = "pandas-setup"}
+borders_10 = borders.head(10)
+
+borders_10.drop(borders_10.columns[[0, 1]], axis = 1)
+```


All of this covers accessing a dataframe, including predetermined sections of it, then a large section of working with conditionals. There are many methods that will allow insights to the data, or preparation based on values, such as query(), isin(), or filter().
Why are the examples are only using the first 10 rows?

intro.Rmd

rmccreath · 2025-08-25T14:55:40Z

intro.Rmd

+
+baby5.merge(baby6, how = 'left', on = ['FAMILYID','DOB'])
+```
+


This is pretty long course as it is, but including some details about where the go next, thinking visualisations and producing outputs. While we won't have courses for these soon, it would be good to include details about what packages are recommended at least.

rmccreath · 2025-08-25T14:57:03Z

intro.Rmd

+```
+
+## Help & Feedback
+


Help? Where should users go for learning more or knowing how to deal with bugs? What about the Data and Intelligence Forum - Python channel, what about references to further materials like PEP8?

intro.Rmd

images/python-pwb.png

Tina815 and others added 21 commits March 1, 2024 16:04

three chapters

2f1d672

Had to input eval = FALSE into the second setup codechunk to bypass it.

cd23d6f

Pandas Module was not found.

Changed some of the .png to .PNG in order to correctly map to the ima…

1ef6844

…ge files.

adding code for mean, median, description (summary), and frequency/cr…

fdfd9d8

…osstab/margin

Removed import pandas text.

f1abbd4

Started Location Section

Wrangle #1

5bda74b

Wrangle #2 beginning

495d354

Added new dataframes in order to illustrate merging/joining.

dcb6ab8

Added joining image. Completed sections on groupby and aggregate.

Completing merge section.

d5b9d1f

Adding rename and recode section.

Deleting random lines

3ad6c51

Indent

6e323c5

Adding new sections

fe6aa96

adding new sections

972fb64

Minor edits.

04d0992

small updates

fce23da

deleted html

745e38a

Merge pull request #1 from Public-Health-Scotland/vay

1b2302f

Python Merge Request

add contents to wrangle part 1

d31b452

update intro

94c1c23

Merge pull request #2 from Public-Health-Scotland/tina

3f0a762

Review from Tina

add feedback form

83d35fe

rmccreath commented Aug 21, 2025

View reviewed changes