Conversation
…read functions accepting FileLike incorrectly.
|
hi @nachomaiz thanks a lot! I am a bit snowed right now, but will take a look as soon as I get some time.
@jonathon-love please check this PR, probably it addresses the same as in yours? |
|
Thank you for the feedback! I've now changed I also took a quick look at @jonathon-love's PR, and I realized I was missing specific types for I switched the dataframe types for the write functions to Additionally, I've added |
|
hi @nachomaiz thanks for your efforts! I have cloned your fork, compiled it (all ok) and then run the tests. The test_basic.py and test_narwhalified.py fail with 3 errors. The origin is probably that in the metadata _container class, if you look carefully, there are some members like number_rows that before were by default None, and now you are defining them as 0, this raises an inconsistency when using metadataonly, which also breaks read_in_chunks when reading an export file. So, could you please review those members and adapt them to be as they were before? oh now I see your comment No, that is not correct, those values are not always set and therefore the None's need to be there, also do not default datetime.now but to None. Another one: typing_test.py raises a lot of errors. I am less familiar with mypy so I have not checked what they are about. I think you have to get a machine where you can compile pyreadstat and be able to run the tests, please run test_basic.py, test_narwhalified.py with backend==pandas and backend==polars and test_http_integratio.py. BTW,please rename typing_test.py to test_typing.py just to keep the naming pattern. Last one: I think this might be unnecessary if you included py.typed in the manifest. The issue with package data, is that on windows, when people install Python from the window app market store, it installs the package and package data into different places (can't remember exactly), and I am not sure if in such case the IDE will see the py.typed (maybe yes?). I had such an issue in the past when I had to deliver dll files for windows, and python was not able to find them. I think this has to be tested. Otherwise it looks good! =) Speed is also the same as before when I converted the files from pyx to py, so it seems the dataclass change is neutral in terms of performance. |
|
Hello! Thank you for reviewing and for your feedback! I'm working on setting up a machine to be able to compile and run tests, will hopefully have it soon. In the meantime, just wanted to get your thoughts on a couple of the things you mentioned above. I have now gone through the code a bit more carefully and found the places where the What makes it a bit complicated in my view is that if we set if meta.number_rows is not None:
...Which may be the easier way to handle things in the end, but also feels redundant when for many users it would never be ... On the There are a few rows which should error, there should be 5 errors in that file (there are comments in the file where it points them out). It also analyzes other files as the test file imports them, so I'm ignoring those files for the purpose of these annotations. I noticed that there's an import error in the file so at the moment it doesn't work correctly, I'll fix that soon. I should also mention that both I'll fix those few bits and remove the extra |
|
hi @nachomaiz Regarding the topic of num_rows being int or None, None signals that it was not possible to recover the information from the metadata and therefore it is undefined. It is not correct to say that happens only for POR and XPORT files, theoretically can happen to any file type if the writing application did not write that information, for example in the case of SPSS SAV files, some applications do not write the number of rows and therefore cannot be determined and should stay as None (see for example #109). However, I am not 100% sure of what the problem is ... this is the way it has been for years and there has been no problems so far. I am also reluctant to change the interface unless it is strictly necessary. So can you please explain a bit more what your concern is? If you mean the user needs to check the possibility that num_rows is None, yes, the user should do that if wants to be strict, no way around that, for the reason explained before. Please also notice that I would like all the members that were None before to stay as they are, not only num_rows. |
|
Ok! That makes sense. My mistake for assuming things. 😅 Will bring back all the Hopefully that gets it to a good place to merge! |
|
hi @nachomaiz thanks! Regarding the typing tests, please indicate in the comment at the top of the file, where you indicate that it has to be run with mypy, which other packages need to be installed in order to run. We need the tests to be executable, they should have assertions which should all of them pass if everything is fine and fail if something is wrong. These tests will be then run in order to make the wheels and expected to pass, so reveal_type is not enough. So please transform your tests into an executable and rename it as suggested before. I have never done this, so not sure what is better, a quick search says you can use either assert_type (would be nice as no extra package needed, then you could do similarly as test_narwhalified.py) or pytest-mypy-plugin (would require to install extra stuff, but apparently you can write negative tests more easily). |
nachomaiz
left a comment
There was a problem hiding this comment.
Hi @ofajardo,
Thank you!
I've reworked the numpy type annotations for 3.11+ and have successfully run all the tests.
I've also fixed the duplicated test and replaced it with the correct xport test.
It seems like last week pytest-mypy-plugins updated, and with the new version the behavior of the --mypy-only-local-stub flag is now the default, so it's not needed anymore if you update the package to 4.0. I've reflected that in the test file as well.
Let me know if you have any other comments! Cheers!
|
Hello again, While I was updating the Python version of my venv for 3.11+, I found it a bit tricky to make sure I had all test dependencies. There's been an option to define group dependencies for some time, and it might be helpful to bring that into this project's Something like: [dependency-groups]
dev = ["pandas[pyarrow]>=2.0.0", "polars>=1.30.0"]
test = ["pytest>9.0.0", "pytest-mypy-plugins>=4.0.0"]Then, the environment setup command for testing or development could look like (requires pip install --group dev --group testSome tools like Given this PR adds a new test dependency ( |
|
hi @nachomaiz thanks for the improvements!
|
|
I am also asking Claude code what it thinks about the tests, it says several things most of them I consider it is been too picky, but a couple catched my eye, so I write them here for your consideration: I also asked about the type hints in pyreadstat.py and a couple that I was intrigued about: particularly in number 4 ... is FileLike not going to work with worker.py? if that's true, that is something to correct. Number 7 is a good one, not related to the typing, so I can also fix it later, or you could fix it if you feel like ... Number 5, makes sense ... what do you think about that one? it may become float, but may also be int most of times. |
|
Oh very cool! Thanks for checking and running this through Claude, and sorry for the back and forth. There must be something that changes enough in each combination of pandas/mypy/numpy typing that breaks some of these tests, and I agree that they are a bit flakey due to that, because they pass when I run them, which is a bit concerning. I'll need to read through the rest properly & implement changes, though it looks like most are simple fixes due to copy-pasting mistakes (apologies). On the "inf" row limit question, I'd also be surprised to get a float row limit because rows can only be whole numbers. But that is logically impossible since all possible integer values must be smaller than "inf" when the function goes through min-max clamping. I think it's fine to have type hints only at the API boundaries, as long as the logic is consistent it doesn't need to be fully typed within the function. So I'd recommend ignoring that. But if we want to be type correct there, I'd suggest setting the default in the On the file-like question for And I'll read through the rest & implement changes soon as well. |
|
hi, Yes, I am also a bit concerned on the maintainability of the tests if they some time pass and sometimes not. Let's try to find something that is more robust. If you find the reason for the problem and the solution please explain what was happening. Regarding the row limit thing, I would let's leave it as is. Regarding the other claude comments, check them, but it is an LLM, so please share your thoughts on why changing or not changing something. I have more question: in pyreadstat.py lines 34 to 44 (if TYPE_CHECKING), here you try to import pandas and polars and if the import fails you pass, which is cool, becase the user may not have both of them installed. But on line 44 then you define the Dataframe type alias as either the polars or the pandas dataframe. If one of them failed to get imported, then this will just default to the one that is present? Have you tested what happens if you have only one of them installed? Thanks again for your great work, I think we are very close to finish! |
|
While I was fixing some of the issues above, I started testing a few things with the numpy output formats in runtime and I noticed something: >>> data, meta = pyreadstat.read_sav("test_data/basic/example.sav", output_format="dict")
>>> type(data["mychar"])
list # Expected `numpy.ndarray` based on documentationI started looking for where the creation of the column data container happens, and I found it here in So it seems like the data is never stored as numpy arrays when This actually would simplify this typing issue with numpy; we can set the output type to I'm not sure if there's a reason to create numpy arrays for pandas specifically, I'd expect lists would work too and the data dictionary is not visible to the user, but in any case I'd imagine the docstrings and type hints should reflect this behavior with Before marking the outputs as |
nachomaiz
left a comment
There was a problem hiding this comment.
Actually, sorry one correction, read_file_multiprocessing actually casts the column data to numpy:
This seems a bit inconsistent overall... I'd suggest picking one format and sticking with it throughout the API, so I'll wait for your thoughts before making any changes.
|
hi @nachomaiz. Thanks! you are right and that was a very good catch! What happens is that for pandas I need to build a dict of numpy arrays, because when I tried with lists instead, it becomes super slow, and in addition you run out of memory if the lists are big. For polars in the other hand, it has to be lists, because it is faster than numpy arrays and the types are interpreted correctly if there are missing values (polars gets confused with np.nan). Now, the dict output is historically something in between: users where asking for polars support and the low hanging fruit was to return the dict of numpy arrays I was building anyway for pandas for them to do the polars.from_dict. Later, I realized that polars did not like the numpy arrays and I changed it to lists. So what you see in the multiprocessing is a left over from the time dict was containing numpy arrays that I forgot to clean. So now, in the branch pyfile_dev, I corrected that issue and now the values of a dict should be always lists. Please merge that into your fork, and change the types to dict[str, list[Any]] as you suggested. |
|
Hi @ofajardo! Thanks for making the change! I've merged it into the PR and I'm making some of the changes we discussed. I'll finish up soon with a bit more of a write up with the changes and answering your questions about design choices. |
|
Hi again! Ok, so I've updated the PR with a few changes, I'll try to list them and explain the reasoning behind them, so please excuse the long text. 😅
There are still some typing issues and inconsistencies that are unfortunately not supported, mostly to do with function signatures and overloads. I know of a few, but there probably are more edge cases out there that I'm not aware of:
Hopefully this clarifies the changes and thinking behind them. Let me know if you have any other lingering questions, or if you prefer I revert any of the new changes. Thanks for your patience and help! Hope this gets it close to merging! |
|
hi @nachomaiz thanks a lot for your detailed and hard work! I think we are going to be there soon! I am reviewing and so far everything looks excellent. But I got a few errors when running the typing tests, at least the first one seems to be something related to mypy version, I am using 1.20.0, maybe you are using an older one? if so I would recommend to update things to the newest version, I haven't checked the other errors, many seem the same thing, but there are a couple that may be something else, could you please check? |
|
Huh, interesting! So yes, I was using 1.19.1 up until now. It looks like 1.20.0 was released last week, and looking at the release notes it's not very apparent what changed that would have made those fail now... But yeah, it seems like their rewrite of the type cache format they talk about in the notes has changed the behavior so that I've fixed those issues to work with the new version, and bumped the version of |
|
hey @nachomaiz, I merged! thanks again for all the hard work so far! What happens now is, I am going to prepare everything to send this branch to the CI/CD pipeline. If everything passess there, I will upload to pipy test, I will let you know to test the package, and then finally I do a release. I also asked claude to write a script to test the types at run time, I will let you know once I put it for you to take a look. |
|
Amazing, thank you very much! I'll keep the branch open until then and delete after all looks good. I learned a lot about typing, cython and this library so I'm very grateful for all your help and patience. Will check on test PyPI once you let me know! |
|
hi @nachomaiz big success! the CI/CD pipeline worked well!. Maybe, could you manually download the wheel from here and try it?, it is a bit easier than uploading to pipy .... The new test file is here in case you would like to take a look. The only side effect of all of this is that Pyreadstat is not working anymore for Python 3.10 (see here ). As I said before, I am ok with that. However, I was expecting that now that we are not using numpy types, it would work for 3.10, but actually it fails at importing pyreadstat: As I am not familiar with all this typing stuff yet, I asked claude what is the cause and what could be the solutions: Option A looked sort of OK, so I asked what are the implications of doing that: So, I like backwards compatibility, but the solutions look a bit odd to me, so I have not implemented anything. Do you have any thoughts on this? |
|
Hmm I see... Yeah I didn't realize that would make it not work with 3.10. The original code with the The reason I wanted to use So we could revert that, which I think for consumers of the library would look pretty much the same as with the ellipsis. If you'd rather avoid creating more PRs/branches, you could implement that in your branch by importing _P = ParamSpec("_P")
PyreadstatReadFunction: TypeAlias = Callable[
Concatenate[FilePathorBuffer, _P], "tuple[DataFrame | DictOutput, metadata_container]"
]Alternatively, I would actually suggest that, instead of moving the declaration inside It would look like this: PyreadstatReadFunction: TypeAlias = "Callable[
Concatenate[FilePathorBuffer, ...], tuple[DataFrame | DictOutput, metadata_container]
]"I believe most type checkers tend to backport these sematic changes to previous versions, even if they would fail at runtime, and expect them to be declared inside string quotes, so it should work unless I am not correct. Might need to see if the runtime type checks and the typing checks are happy with that. |
Hi @ofajardo!
This PR aims to fix #299, adding type annotations to all public interface functions and classes.
I based them on the docstrings and how I understand the code is operating with the different parameters and class attributes, but I might have missed something.
I wasn't able to compile the library in this machine, however I have done a runtime check of the type annotations to make sure everything runs in py3.10+.
How it works:
TypedDictclasses for missing ranges and MR sets.FileLikeprotocol with the methodsreadandseek.os.PathLikefor flexibility withos.fsencodeWrite functions accept any dataframe object supported byWrite functions accept either anarwhals.pandas.DataFrameor apolars.DataFrameas the first argument.PyreadstatReadFunctioncallable type. It's first argument must be a path/file-like object and it must return a tuple of data and metadata.narwhalstype vars to signal the return of the same type of dataframe.While I added type annotations, I saw a few issues with the docstrings. I took the liberty to sync them up with the type annotations.
I also used a formatter for the function signatures as they were getting unwieldy. This changed the formatting of some of the code within the functions, so let me know if you would prefer I revert those.
Looking forward to your feedback.