gpipe43 is a full text RSS generator which can hosted on Google App Engine. Use Regex to search and format full text from a article, or any other content that you want.
Inspired by Yahoo Pipes and Feed43.
Yahoo Pipe RIP.
- Support multi page.
- Display all images of article's gallery.
- Appending article's comment is possible.
- Create a new Cloud Platform project and App Engine application
- Create a bucket in google cloud storage
- Install Google Cloud SDK Python
- add UA
prjname
: Name of your project on app enginebucket_name
: Name of bucketsubdir4bg
: The crawler working under: http://[prjname].appspot.com/[subdir4bg]/[rssname]subdir4rss
: This is your RSS site: http://[prjname].appspot.com/[subdir4rss]/[rssname]
rssname
: RSS's name.siteurl
: The website or a RSS feed that you want to generat fulltext RSS.reg4site
: Regex that can find articles' URL. Leave a blank if siteurl is a feed.reg4title
: Regex for title of a article. Leave a blank if siteurl is a feed.reg4pubdate
: Regex for publish date of a article. Leave a blank if siteurl is a feed. The format of pubdate must contain '%Y-%m-%d', otherwise leave a blank.reg4text
: Regex for main body of a article.reg4comment
: Regex for comment. Not necessary, can leave it blank. You can also use this Regex to find all the image of a gallery in the article.reg4nextpage
: Regex for article's next page if there's more than one page.Anzahl
: How much article will be generated. If there's not only one siteurl, this limit for EVERY SINGLE siteurl instead of for all articleurl from all siteurl. 0 = no limit.*encoding
: Optional. Generally chardet can detect the right encoding, but sometimes it cannot(for example, recognize gb18030 as gb3212), so I use 'replace' option of decode method to avoid illegal character, then there's replacement character in generated feed. So you can specify the encoding of the website. It only influence main text.rssgen.ausfuehren('use_urllib/use_urlfetch', 'st/mt', siteurl, reg4site, reg4title, reg4pubdate, reg4text, reg4comment, reg4nextpage, Anzahl)
: Generat a RSS from a website.feed_fulltext.ausfuehren('use_urllib/use_urlfetch', siteurl, reg4nextpage, reg4text, reg4comment, Anzahl, rssname)
: Use this to generat fulltext from a RSS feed.use_urllib
: Use urllib2,with UAuse_urlfetch
: Use urlfetch,no UAmt
: Multi threadingst
: Single threading
- Replace 'example' to your own RSS's name
- Replace subdir4bg, subdir4rss, example to your own.
See official guide: app.yaml Reference, Scheduling Tasks With Cron for Python
- Edit ./main/Vorlage.xml and Vorlage_Error.xml, you can fill the properties of elements 'generator', 'webMaster' and 'copyright'.
- If you just would like to format an existing feed, see example_02.py, then add url and script to app.yaml. It's not necessary to add it in feed_list.py and cron.yaml, because the feed will not save in cloud storage.
dev_appserver.py [PATH_TO_YOUR_APP]/app.yaml
Start the crawler: http://localhost:8080/[subdir4bg]/[rssname]
When done, here to check your RSS: http://localhost:8080/[subdir4rssg]/[rssname]
See official guide: Using the Local Development Server
- cd to the directory of your project
gcloud config set project PROJECT_NAME
gcloud app deploy app.yaml cron.yaml --version=VERSION_NUMBER
See official guide: Deploying a Python App