Here are several issues I find in the rule-based entity/number extraction functions in data_utils.py. I'm still working on the dataset and will probably update more once I observe them. These issues are generally sorted by the order of their impact in my impression.
- Implicit number words, such as
(a|an|a pair of|a trio of) followed by nouns like (rebound|turnover|assist|block|steal|board|three-pointer|three pointer|free-throw|free throw|dime), are ignored. This probably results in the largest number of omissions.
- Besides aliases like
{'Los Angeles', 'LA'}, there are other aliases, like {'76ers', 'Sixers'} and {'Mavericks', 'Mavs'}.
- The NLTK tokenizer did not separate suffixes containing
’ (looks similar to ', but with Unicode 2019), which makes some entities cannot be identified.
- A player name with initials, such as
J.J., may be present in another form, like JJ.
- A player name ending with
Jr., may also end with Jr, , Jr. or , Jr.
- A player name with a hyphen, in the form
A B-C, may be present in another form A-B C.
- Other minor issues, such as
Oklahoma city.
You can search for these cases in the dataset.
Hopefully, my findings could help people in the future. Actually, I found these issues when I was working on Lin et al, 2020 some years ago. These issues make the dataset quite noisy.
I do have my own version of data_utils.py which fixes these issues to some extent.
Thanks for your pioneering work!
Here are several issues I find in the rule-based entity/number extraction functions in
data_utils.py. I'm still working on the dataset and will probably update more once I observe them. These issues are generally sorted by the order of their impact in my impression.(a|an|a pair of|a trio of)followed by nouns like(rebound|turnover|assist|block|steal|board|three-pointer|three pointer|free-throw|free throw|dime), are ignored. This probably results in the largest number of omissions.{'Los Angeles', 'LA'}, there are other aliases, like{'76ers', 'Sixers'}and{'Mavericks', 'Mavs'}.’(looks similar to', but with Unicode 2019), which makes some entities cannot be identified.J.J., may be present in another form, likeJJ.Jr., may also end withJr,, Jr.or, Jr.A B-C, may be present in another formA-B C.Oklahoma city.You can search for these cases in the dataset.
Hopefully, my findings could help people in the future. Actually, I found these issues when I was working on Lin et al, 2020 some years ago. These issues make the dataset quite noisy.
I do have my own version of
data_utils.pywhich fixes these issues to some extent.Thanks for your pioneering work!