Skip to content

Commit

Permalink
GithubCrawl - works without persistence
Browse files Browse the repository at this point in the history
- it's working with in-memory representations
  - persistence is disabled
- Sqlite persistence is possible, but the serializers are not working yet
  - see lostisland/sawyer#53
  • Loading branch information
dazza-codes committed Feb 3, 2018
1 parent 67bf7ba commit 1cf8654
Show file tree
Hide file tree
Showing 18 changed files with 464 additions and 49 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,7 @@

# disk db
github_crawl.db

# logs
log/sql.log

12 changes: 8 additions & 4 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,12 @@ PATH
remote: .
specs:
github_crawl (0.1.0)
daybreak
highline
octokit (~> 4.0)
pry
pry-doc
sequel
sqlite3

GEM
remote: https://rubygems.org/
Expand All @@ -13,12 +17,12 @@ GEM
coderay (1.1.2)
crack (0.4.3)
safe_yaml (~> 1.0.0)
daybreak (0.3.0)
diff-lcs (1.3)
docile (1.1.5)
faraday (0.14.0)
multipart-post (>= 1.2, < 3)
hashdiff (0.3.7)
highline (1.7.10)
json (2.1.0)
method_source (0.9.0)
multipart-post (2.0.0)
Expand Down Expand Up @@ -49,11 +53,13 @@ GEM
sawyer (0.8.1)
addressable (>= 2.3.5, < 2.6)
faraday (~> 0.8, < 1.0)
sequel (5.5.0)
simplecov (0.15.1)
docile (~> 1.1.0)
json (>= 1.8, < 3)
simplecov-html (~> 0.10.0)
simplecov-html (0.10.2)
sqlite3 (1.3.13)
vcr (4.0.0)
webmock (3.3.0)
addressable (>= 2.3.6)
Expand All @@ -67,8 +73,6 @@ PLATFORMS
DEPENDENCIES
bundler (~> 1.16)
github_crawl!
pry
pry-doc
rake (~> 10.0)
rspec (~> 3.0)
simplecov
Expand Down
17 changes: 15 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,20 @@ Then print to stdout a summary of the top 10 repositories by count.

## Usage

$ github_crawl {repo_name}
$ github_crawl

Responses to prompts are optional (a RETURN is acceptable). Environment variables can be
set to skip the prompts each time, i.e.

$ export GITHUB_USER={user_login}
$ export GITHUB_PASS={user_password}
$ export GITHUB_REPO="{owner}/{repo}"
$ github_crawl

A github user login:pass allows authenticated github API requests. The authentication is
optional, but recommended because the github API rate limit is much higher with authentication.
The github user:pass does not need to be an authorized committer on a repository to crawl it.
Without a github repository to begin with, it defaults to "kubernetes/kubernetes".

## Development

Expand All @@ -34,7 +47,7 @@ push git commits and tags, and push the `.gem` file to [rubygems.org](https://ru

- https://developer.github.com/v3/libraries/
- https://developer.github.com/v3/#rate-limiting
- 5000 requests per hour
- 5000 requests per hour, if authenticated

## Contributing

Expand Down
6 changes: 3 additions & 3 deletions bin/console
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
#!/usr/bin/env ruby

require "bundler/setup"
require "github_crawl"
require 'bundler/setup'
require 'github_crawl'

# You can add fixtures and/or initialization code here to make experimenting
# with your gem easier. You can also use a different console, if you like.

require "pry"
require 'pry'
Pry.start
75 changes: 75 additions & 0 deletions exe/github_crawl
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
#!/usr/bin/env ruby

require 'bundler/setup'
require 'github_crawl'
require 'highline'
require 'set'

# ---
# Command Line Prompts and Configuration

cli = HighLine.new

repo_name = ENV['GITHUB_REPO'] || cli.ask('github repo in the form "{owner}/{repo}": ')
repo_name ||= 'kubernetes/kubernetes'

github_user = ENV['GITHUB_USER'] || cli.ask('github user: ')
github_pass = ENV['GITHUB_PASS'] || cli.ask('github pass: ') { |q| q.echo = '*' }
unless github_user.nil? && github_pass.nil?
Octokit.configure do |c|
c.login = github_user
c.password = github_pass
end
# # TODO: try to use an auth-token
# auth = Octokit.create_authorization(:scopes => ["user"], :note => "GithubCrawl")
# Octokit.bearer_token = auth[:token]
end

Octokit.auto_paginate = true


# ---
# Github Crawling by repo
#
# TODO: try to use https://developer.github.com/v3/#conditional-requests

# Accumulate repository information in this repos hash; note that the
# keys are repository "name" strings and not "full_name" strings.
repos = {}

begin
repo = GithubCrawl::Repo.new(full_name: repo_name)
contributors = repo.contributors
repos[repo.name] = contributors.map(&:login).to_set

contributors.each do |user|
GithubCrawl.check_rate_limit
user.repos.each do |user_repo|
# Q: a user could fork a repository without ever contributing to it; so
# does this need to check whether a user is also a contributor to a repository?
repos[user_repo.name] ||= Set.new
repos[user_repo.name] << user.login
end
end
rescue StandardError => err
puts err.message
end

# ---
# Report the most popular repositories among the contributors

# sort the repos by the number of users who list them among their repositories
repos_sorted = repos.sort { |r1, r2| r2[1].size <=> r1[1].size }

# report the repo contributor count and it's name for the top 10 repos
repos_sorted.slice(0, 9).map { |repo| puts "#{repo[1].size}: #{repo[0]}" }

# It's interesting to pause here to inspect all the data. For example:
# repos.length
# repos.values.map(&:length).uniq
# repos_sorted.slice(0, 9).map { |repo| puts "#{repo[1].size}: #{repo[0]}\n\t#{repo[1].sort}" }

# Cnt-D or exit! to quit
require 'pry'
binding.pry

10 changes: 7 additions & 3 deletions github_crawl.gemspec
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,18 @@ Gem::Specification.new do |spec|
spec.files = `git ls-files -z`.split("\x0").reject do |f|
f.match(%r{^(test|spec|features)/})
end
spec.bindir = "exe"
spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
spec.require_paths = ["lib"]

spec.add_dependency "daybreak"
spec.add_dependency "highline"
spec.add_dependency "octokit", "~> 4.0"
spec.add_dependency "sequel"
spec.add_dependency "sqlite3"
spec.add_dependency "pry"
spec.add_dependency "pry-doc"

spec.add_development_dependency "bundler", "~> 1.16"
spec.add_development_dependency "pry"
spec.add_development_dependency "pry-doc"
spec.add_development_dependency "rake", "~> 10.0"
spec.add_development_dependency "rspec", "~> 3.0"
spec.add_development_dependency "simplecov"
Expand Down
20 changes: 17 additions & 3 deletions lib/boot.rb
Original file line number Diff line number Diff line change
@@ -1,7 +1,21 @@
require 'daybreak'
require 'octokit'

require 'github_crawl/version'
# Github data
require 'octokit'
require 'github_crawl/repo'
require 'github_crawl/user'

# Local persistence
require 'sequel'
require 'sqlite3'
require 'github_crawl/sawyer_serializer'
require 'github_crawl/sql_db'
require 'github_crawl/sql_base'
require 'github_crawl/sql_repos'
require 'github_crawl/sql_users'

# Serializers
require 'json'
require 'yaml'

require 'github_crawl/version'

15 changes: 14 additions & 1 deletion lib/github_crawl.rb
Original file line number Diff line number Diff line change
@@ -1,5 +1,18 @@
require 'boot'

module GithubCrawl
DB = Daybreak::DB.new 'github_crawl.db'
DB = SqlDb.new

# Check the rate limit
# @return [void]
def self.check_rate_limit
response = Octokit.last_response
return if response.nil?
rate_limit = response.headers['x-ratelimit-limit'].to_i # hits per hour
rate_remaining = response.headers['x-ratelimit-remaining'].to_i
rate_reset = response.headers['x-ratelimit-reset'].to_i
return if rate_remaining > 100
puts "WARNING: rate limit (#{rate_limit}) remainder: #{rate_remaining}"
puts "WARNING: rate limit (#{rate_limit}) resets at #{Time.at(rate_reset)}"
end
end
56 changes: 41 additions & 15 deletions lib/github_crawl/repo.rb
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,31 @@ module GithubCrawl
# A Github repository
class Repo

# @param [String] repo_name in the form "{owner}/{repo}"
def initialize(repo_name)
# @repo = db_read(repo_name)
@repo ||= fetch_repo(repo_name)
raise "Cannot locate repo: #{repo_name}" unless full_name == repo_name
attr_reader :repo

# @param [String] full_name in the form "{owner}/{repo}"
# @param [Sawyer::Resource] repo data from github
def initialize(full_name: nil, repo: nil)
if repo.is_a?(Sawyer::Resource)
@repo = repo
# db_save(@repo)
elsif full_name.is_a?(String)
@repo = db_read(full_name)
@repo ||= fetch_repo(full_name)
end
raise 'Cannot locate repo' if @repo.nil?
rescue StandardError => err
log_error(err)
end

# @return [Array<String>] user logins
# @return [Array<GithubCrawl::User>] github users
def contributors
repo.rels[:contributors].get.data.map(&:login)
@contributors ||= begin
data = repo.rels[:contributors].get.data
# TODO: get additional paginated data?
# data.concat Octokit.last_response.rels[:next].get.data
data.map { |user| GithubCrawl::User.new(user: user) }
end
rescue StandardError => err
log_error(err)
end
Expand All @@ -23,33 +36,46 @@ def full_name
repo[:full_name]
end

private
def name
repo[:name]
end

attr_reader :repo
private

# @param [String] repo_name in the form "{owner}/{repo}"
# @return [Sawyer::Resource, nil] repo resource
def fetch_repo(repo_name)
repo = Octokit.repo(repo_name)
db_save(repo) unless repo.nil?
repo
end

# @param [String] repo_name in the form "{owner}/{repo}"
# @return [Sawyer::Resource] repo
def db_read(repo_name)
GithubCrawl::DB[repo_name]
# @param [String] full_name for repo
# @return [Sawyer::Resource, nil] repo
def db_read(full_name)
# TODO: disabled until the deserialization works correctly
return nil
model.read(full_name)
end

# @param [Sawyer::Resource] repo
# @return [void]
def db_save(repo)
GithubCrawl::DB.lock do
GithubCrawl::DB.set! repo[:full_name], repo.to_h
# TODO: disabled until the deserialization works correctly
return nil
if model.exists?(repo[:full_name])
model.update(repo)
else
model.create(repo)
end
rescue StandardError => err
log_error(err)
end

def model
@model ||= GithubCrawl::SqlRepos.new
end

def log_error(err)
STDERR.write "#{err.message}\n"
STDERR.flush
Expand Down
25 changes: 25 additions & 0 deletions lib/github_crawl/sawyer_serializer.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
module GithubCrawl

# Serialize and deserialize Sawyer Resource data
module SawyerSerializer

# @param [Sawyer::Resource] sawyer_resource
# @return [String] serialized data
def serialize(sawyer_resource)
# attrs = sawyer_resource.attrs.to_h.to_json
# rels = sawyer_resource.rels.to_h.to_json
# fields = sawyer_resource.fields.to_a.to_json
Marshal.dump(sawyer_resource.marshal_dump)
end

# @param [String] serialized data
# @return [Sawyer::Resource] sawyer_resource
def deserialize(dumped)
agent = Sawyer::Agent.new('https://api.github.com/',
links_parser: Sawyer::LinkParsers::Simple.new)
resource = Sawyer::Resource.new(agent)
resource.marshal_load(Marshal.restore(dumped))
resource
end
end
end
Loading

0 comments on commit 1cf8654

Please sign in to comment.