Several issues I find in the entity/number extraction functions

Here are several issues I find in the rule-based entity/number extraction functions in `data_utils.py`. I'm still working on the dataset and will probably update more once I observe them. These issues are generally sorted by the order of their impact in my impression.
1. Implicit number words, such as `(a|an|a pair of|a trio of)` followed by nouns like `(rebound|turnover|assist|block|steal|board|three-pointer|three pointer|free-throw|free throw|dime)`, are ignored. This probably results in the largest number of omissions.
2. Besides aliases like `{'Los Angeles', 'LA'}`, there are other aliases, like `{'76ers', 'Sixers'}` and `{'Mavericks', 'Mavs'}`.
3. The NLTK tokenizer did not separate suffixes containing `’` (looks similar to `'`, but with Unicode 2019), which makes some entities cannot be identified.
4. A player name with initials, such as `J.J.`, may be present in another form, like `JJ`.
5. A player name ending with `Jr.`, may also end with `Jr`, `, Jr.` or `, Jr`.
6. A player name with a hyphen, in the form `A B-C`, may be present in another form `A-B C`.
7. Other minor issues, such as `Oklahoma city`.

You can search for these cases in the dataset.

Hopefully, my findings could help people in the future. Actually, I found these issues when I was working on [Lin et al, 2020](https://arxiv.org/abs/1901.09501) some years ago. These issues make the dataset quite noisy.

I do have my own version of `data_utils.py` which fixes these issues to some extent.

Thanks for your pioneering work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several issues I find in the entity/number extraction functions #19

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Several issues I find in the entity/number extraction functions #19

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions