`apply_value_formats=True` produces mixed-type object columns that break `to_parquet()`

When reading an SPSS `.sav` file with `apply_value_formats=True` (the default when called via `pandas.read_spss()`), numeric variables that have value labels applied end up as `object`-dtype columns containing a mix of numeric-looking strings and actual label strings.

For example, an SPSS variable for household size might have values 1–9 stored as doubles, with value label `"10 personen of meer"` for the code 10. After `apply_value_formats`, the resulting pandas column has dtype `object` with values like `1.0, 2.0, ..., 9.0, "10 personen of meer"`.

This is problematic because pyarrow's type inference (used by `df.to_parquet()`, `df.to_feather()`, polars, HuggingFace datasets, etc.) sees the first values, infers `double`, and then crashes on the string values:

```
ArrowInvalid: ("Could not convert '10 personen of meer' with type str:
tried to convert to double", 'Conversion failed for column HHLft1 with type object')
```

This affects **both** `category`-dtype columns (from `formats_as_category=True`) **and** `object`-dtype columns where not all values in the variable have labels (so `formats_as_category` doesn't apply and the column stays `object` with mixed content).

### Reproducer

```python
import pyreadstat
import pandas as pd

# Any SPSS file where a numeric variable has partial value labels
# (e.g. codes 1-9 have no labels, code 10 has label "10 or more")
df, meta = pyreadstat.read_sav("example.sav", apply_value_formats=True)

# Column with partial labels is now object-dtype with mixed content
print(df["HHLft1"].dtype)       # object
print(df["HHLft1"].unique())    # [1.0, 2.0, ..., '10 personen of meer']

# This crashes:
df.to_parquet("test.parquet")
# ArrowInvalid: tried to convert to double
```

### Suggested fix
When `apply_value_formats=True` replaces some (or all) numeric values with their string labels, the entire column should be cast to `str` dtype. Once you've substituted labels for codes, the column is semantically a string column — the remaining numeric values are just codes that happened to not have a label, and keeping them as floats in an object column creates a type inconsistency.

This could be done in pyreadstat's post-processing step that applies value formats: after substitution, if any value in the column is a string, cast the entire column to `str`.

### Workaround
```python
for col in df.columns:
    if df[col].dtype == "object" or df[col].dtype.name == "category":
        df[col] = df[col].astype(str)
```

### Related issues
- apache/arrow#3280 and https://github.com/apache/arrow/issues/20719: pyarrow `from_pandas` fails on mixed-type object columns (open since 2018, unresolved)
- apache/arrow#15133: same issue via `to_feather()`
- pandas-dev/pandas#46863: related categorical + parquet failures

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`apply_value_formats=True` produces mixed-type object columns that break `to_parquet()` #323

Reproducer

Suggested fix

Workaround

Related issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

apply_value_formats=True produces mixed-type object columns that break to_parquet() #323

Description

Reproducer

Suggested fix

Workaround

Related issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`apply_value_formats=True` produces mixed-type object columns that break `to_parquet()` #323