Skip to content

apply_value_formats=True produces mixed-type object columns that break to_parquet() #323

@EwoutH

Description

@EwoutH

When reading an SPSS .sav file with apply_value_formats=True (the default when called via pandas.read_spss()), numeric variables that have value labels applied end up as object-dtype columns containing a mix of numeric-looking strings and actual label strings.

For example, an SPSS variable for household size might have values 1–9 stored as doubles, with value label "10 personen of meer" for the code 10. After apply_value_formats, the resulting pandas column has dtype object with values like 1.0, 2.0, ..., 9.0, "10 personen of meer".

This is problematic because pyarrow's type inference (used by df.to_parquet(), df.to_feather(), polars, HuggingFace datasets, etc.) sees the first values, infers double, and then crashes on the string values:

ArrowInvalid: ("Could not convert '10 personen of meer' with type str:
tried to convert to double", 'Conversion failed for column HHLft1 with type object')

This affects both category-dtype columns (from formats_as_category=True) and object-dtype columns where not all values in the variable have labels (so formats_as_category doesn't apply and the column stays object with mixed content).

Reproducer

import pyreadstat
import pandas as pd

# Any SPSS file where a numeric variable has partial value labels
# (e.g. codes 1-9 have no labels, code 10 has label "10 or more")
df, meta = pyreadstat.read_sav("example.sav", apply_value_formats=True)

# Column with partial labels is now object-dtype with mixed content
print(df["HHLft1"].dtype)       # object
print(df["HHLft1"].unique())    # [1.0, 2.0, ..., '10 personen of meer']

# This crashes:
df.to_parquet("test.parquet")
# ArrowInvalid: tried to convert to double

Suggested fix

When apply_value_formats=True replaces some (or all) numeric values with their string labels, the entire column should be cast to str dtype. Once you've substituted labels for codes, the column is semantically a string column — the remaining numeric values are just codes that happened to not have a label, and keeping them as floats in an object column creates a type inconsistency.

This could be done in pyreadstat's post-processing step that applies value formats: after substitution, if any value in the column is a string, cast the entire column to str.

Workaround

for col in df.columns:
    if df[col].dtype == "object" or df[col].dtype.name == "category":
        df[col] = df[col].astype(str)

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions