Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find, Upload and Cleanse Persian Wiki Dump #4

Open
sehsanm opened this issue Dec 3, 2018 · 6 comments
Open

Find, Upload and Cleanse Persian Wiki Dump #4

sehsanm opened this issue Dec 3, 2018 · 6 comments
Assignees
Labels
Milestone

Comments

@sehsanm
Copy link
Owner

sehsanm commented Dec 3, 2018

  • Find the Persian Wiki Dump
  • Cleanse it ( Remove the XML/HTML tags)
  • Define a corpus file standard. (To be discussed with other Corpus builders) - Most probably one sentence in each line (Talk to owner of Find and upload Persian News Corpus  #1 )
  • Upload the zipped version of the corpus in S3 bucket (Contact @sehsanm to get the access details)
@sehsanm sehsanm added the CORPUS label Dec 3, 2018
@sehsanm sehsanm added this to the Assignment milestone Dec 3, 2018
@FullDataAlchemist
Copy link
Collaborator

in ro chetori mishe bardasht ?

@sehsanm
Copy link
Owner Author

sehsanm commented Dec 4, 2018 via email

@FullDataAlchemist FullDataAlchemist self-assigned this Dec 4, 2018
@FullDataAlchemist
Copy link
Collaborator

thanks.

@FullDataAlchemist
Copy link
Collaborator

Hi. I upload the wiki dump cleaned text data and the sentences are also segmented.
The raw text is also uploaded in another file by mistake. I think it is unusable and you can delete that file.

thanks.

@sehsanm
Copy link
Owner Author

sehsanm commented Dec 16, 2018 via email

@FullDataAlchemist
Copy link
Collaborator

FullDataAlchemist commented Dec 16, 2018

Of course. I sent a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants