Skip to content

Several issues I find in the entity/number extraction functions #19

@wwt17

Description

@wwt17

Here are several issues I find in the rule-based entity/number extraction functions in data_utils.py. I'm still working on the dataset and will probably update more once I observe them. These issues are generally sorted by the order of their impact in my impression.

  1. Implicit number words, such as (a|an|a pair of|a trio of) followed by nouns like (rebound|turnover|assist|block|steal|board|three-pointer|three pointer|free-throw|free throw|dime), are ignored. This probably results in the largest number of omissions.
  2. Besides aliases like {'Los Angeles', 'LA'}, there are other aliases, like {'76ers', 'Sixers'} and {'Mavericks', 'Mavs'}.
  3. The NLTK tokenizer did not separate suffixes containing (looks similar to ', but with Unicode 2019), which makes some entities cannot be identified.
  4. A player name with initials, such as J.J., may be present in another form, like JJ.
  5. A player name ending with Jr., may also end with Jr, , Jr. or , Jr.
  6. A player name with a hyphen, in the form A B-C, may be present in another form A-B C.
  7. Other minor issues, such as Oklahoma city.

You can search for these cases in the dataset.

Hopefully, my findings could help people in the future. Actually, I found these issues when I was working on Lin et al, 2020 some years ago. These issues make the dataset quite noisy.

I do have my own version of data_utils.py which fixes these issues to some extent.

Thanks for your pioneering work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions