When reading an SPSS .sav file with apply_value_formats=True (the default when called via pandas.read_spss()), numeric variables that have value labels applied end up as object-dtype columns containing a mix of numeric-looking strings and actual label strings.
For example, an SPSS variable for household size might have values 1–9 stored as doubles, with value label "10 personen of meer" for the code 10. After apply_value_formats, the resulting pandas column has dtype object with values like 1.0, 2.0, ..., 9.0, "10 personen of meer".
This is problematic because pyarrow's type inference (used by df.to_parquet(), df.to_feather(), polars, HuggingFace datasets, etc.) sees the first values, infers double, and then crashes on the string values:
ArrowInvalid: ("Could not convert '10 personen of meer' with type str:
tried to convert to double", 'Conversion failed for column HHLft1 with type object')
This affects both category-dtype columns (from formats_as_category=True) and object-dtype columns where not all values in the variable have labels (so formats_as_category doesn't apply and the column stays object with mixed content).
Reproducer
import pyreadstat
import pandas as pd
# Any SPSS file where a numeric variable has partial value labels
# (e.g. codes 1-9 have no labels, code 10 has label "10 or more")
df, meta = pyreadstat.read_sav("example.sav", apply_value_formats=True)
# Column with partial labels is now object-dtype with mixed content
print(df["HHLft1"].dtype) # object
print(df["HHLft1"].unique()) # [1.0, 2.0, ..., '10 personen of meer']
# This crashes:
df.to_parquet("test.parquet")
# ArrowInvalid: tried to convert to double
Suggested fix
When apply_value_formats=True replaces some (or all) numeric values with their string labels, the entire column should be cast to str dtype. Once you've substituted labels for codes, the column is semantically a string column — the remaining numeric values are just codes that happened to not have a label, and keeping them as floats in an object column creates a type inconsistency.
This could be done in pyreadstat's post-processing step that applies value formats: after substitution, if any value in the column is a string, cast the entire column to str.
Workaround
for col in df.columns:
if df[col].dtype == "object" or df[col].dtype.name == "category":
df[col] = df[col].astype(str)
Related issues
When reading an SPSS
.savfile withapply_value_formats=True(the default when called viapandas.read_spss()), numeric variables that have value labels applied end up asobject-dtype columns containing a mix of numeric-looking strings and actual label strings.For example, an SPSS variable for household size might have values 1–9 stored as doubles, with value label
"10 personen of meer"for the code 10. Afterapply_value_formats, the resulting pandas column has dtypeobjectwith values like1.0, 2.0, ..., 9.0, "10 personen of meer".This is problematic because pyarrow's type inference (used by
df.to_parquet(),df.to_feather(), polars, HuggingFace datasets, etc.) sees the first values, infersdouble, and then crashes on the string values:This affects both
category-dtype columns (fromformats_as_category=True) andobject-dtype columns where not all values in the variable have labels (soformats_as_categorydoesn't apply and the column staysobjectwith mixed content).Reproducer
Suggested fix
When
apply_value_formats=Truereplaces some (or all) numeric values with their string labels, the entire column should be cast tostrdtype. Once you've substituted labels for codes, the column is semantically a string column — the remaining numeric values are just codes that happened to not have a label, and keeping them as floats in an object column creates a type inconsistency.This could be done in pyreadstat's post-processing step that applies value formats: after substitution, if any value in the column is a string, cast the entire column to
str.Workaround
Related issues
from_pandasfails on mixed-type object columns (open since 2018, unresolved)to_feather()to_parquetfailing for integer-like string values in categorical column pandas-dev/pandas#46863: related categorical + parquet failures