The-Moroccan-News-Corpus

The present corpus was part of a summer internship. We use scrapy spiders/crawlers to crawl the Moroccan newspaper websites and save all the scraped data to either json or txt files. We built spiders/crawlers for the following news websites:

Moroccan News websites

http://ahdath.info/

https://www.akhbarona.com/

https://www.alayam24.com/

https://www.almaghribtoday.net/

https://www.barlamane.com/

https://dalil-rif.com/

https://www.febrayer.com/

https://www.goud.ma/

https://www.hespress.com/

https://ar.hibapress.com/

http://kifache.com/

www.maghress.com

https://www.menara.ma/

https://www.almaghreb24.com/

https://maroctelegraph.com/

https://www.nadorcity.com/

https://tanja24.com/

http://telexpresse.com/

http://ar.le360.ma/

http://www.alyaoum24.com/

http://www.2m.ma/ar/

https://ar.yabiladi.com/

How to use spiders/crawlers?

Every folder represents the project folder for every newspaper.
To scrape any data from any of the newspapers above,

Download its project folder.

On the command line, change directory to the project folder

Invoke the following command to start scrabing the website: scrapy crawl < name of the spider > -o < name of the file >.json

Note

Every spider/crawler automatically saves a text file in addition to either json files or xml files that you determine when you run your spider in the command line.
This is the link to download about 2 gigabytes of texts. https://drive.google.com/open?id=1w2-DTJF2phU3fVf4XkDh1tsN-O3N_baF

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The-Moroccan-News-Corpus

Moroccan News websites

How to use spiders/crawlers?

Note

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
THE CORPUS		THE CORPUS
ahdath		ahdath
akhbarona		akhbarona
alayam24		alayam24
almaghribtoday		almaghribtoday
barlamane		barlamane
dalil_rif		dalil_rif
febrayer		febrayer
good		good
hespress		hespress
hibapress		hibapress
kifache		kifache
maghres		maghres
menara		menara
morc24		morc24
morctelegraph		morctelegraph
nadorcity		nadorcity
tanja		tanja
telexpresse		telexpresse
threesixty		threesixty
today24		today24
twoM		twoM
yabiladi		yabiladi
README.md		README.md
yab.json		yab.json

elsayed-issa/The-Moroccan-News-Corpus

Folders and files

Latest commit

History

Repository files navigation

The-Moroccan-News-Corpus

Moroccan News websites

How to use spiders/crawlers?

Note

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages