Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading Fannie Mae takes unusually long #300

Open
xiaodaigh opened this issue Aug 31, 2019 · 0 comments
Open

Loading Fannie Mae takes unusually long #300

xiaodaigh opened this issue Aug 31, 2019 · 0 comments

Comments

@xiaodaigh
Copy link

xiaodaigh commented Aug 31, 2019

Update

On Julia 1.2 I get the below
I have included a MWE that can actually run by downloading the Fannie data direct from https://docs.rapids.ai/datasets/mortgage-data

However, loading the data takes a really long time! I waited for 2 hours and it didn't finish.

On Julia 1.1.1 I get the below

┌ Warning: In c:/data/perf-jld/Performance_2000Q2.txt line 2063379 has 29 fields but 31 fields are expected. Skipping row.
└ @ TextParse C:\Users\RTX2080\.julia\packages\TextParse\tFXtC\src\csv.jl:372
┌ Warning: In c:/data/perf-jld/Performance_2000Q1.txt line 2162075 has 29 fields but 31 fields are expected. Skipping row.
└ @ TextParse C:\Users\RTX2080\.julia\packages\TextParse\tFXtC\src\csv.jl:372
┌ Warning: In c:/data/perf-jld/Performance_2000Q2.txt line 2063402 has 29 fields but 31 fields are expected. Skipping row.
└ @ TextParse C:\Users\RTX2080\.julia\packages\TextParse\tFXtC\src\csv.jl:372
┌ Warning: In c:/data/perf-jld/Performance_2000Q1.txt line 2162113 has 29 fields but 31 fields are expected. Skipping row.
└ @ TextParse C:\Users\RTX2080\.julia\packages\TextParse\tFXtC\src\csv.jl:372
┌ Warning: In c:/data/perf-jld/Performance_2000Q2.txt line 2063482 has 29 fields but 31 fields are expected. Skipping row.
└ @ TextParse C:\Users\RTX2080\.julia\packages\TextParse\tFXtC\src\csv.jl:372
┌ Warning: In c:/data/perf-jld/Performance_2000Q1.txt line 2162161 has 29 fields but 31 fields are expected. Skipping row.
└ @ TextParse C:\Users\RTX2080\.julia\packages\TextParse\tFXtC\src\csv.jl:372
┌ Warning: In c:/data/perf-jld/Performance_2000Q2.txt line 2063515 has 29 fields but 31 fields are expected. Skipping row.
└ @ TextParse C:\Users\RTX2080\.julia\packages\TextParse\tFXtC\src\csv.jl:372
┌ Warning: In c:/data/perf-jld/Performance_2000Q1.txt line 2162202 has 29 fields but 31 fields are expected. Skipping row.

Code

using Distributed, Statistics
#addprocs(4)

@time @everywhere using JuliaDB, Dagger

##############################################################
# Download & Extract data
###############################################################

;wget http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000.tgz
;tar xzvf mortgage_2000.tgz


##############################################################
# Specify the types of columns
###############################################################

const fmtypes = [
    Int64,                     String,     Union{String, Missing},     Union{Float64, Missing},    Union{Float64, Missing},
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{String, Missing},     Union{String, Missing},
    Union{String, Missing},     Union{String, Missing},     Union{String, Missing},     Union{String, Missing},     Union{String, Missing},
    Union{String, Missing},     Union{String, Missing},     Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{String, Missing},     Union{Float64, Missing},
    Union{String, Missing}]

datapath = "perf/"
ifiles = joinpath.(datapath, readdir(datapath))


colnames = ["loan_id", "monthly_rpt_prd", "servicer_name", "last_rt", "last_upb", "loan_age",
    "months_to_legal_mat" , "adj_month_to_mat", "maturity_date", "msa", "delq_status",
    "mod_flag", "zero_bal_code", "zb_dte", "lpi_dte", "fcc_dte","disp_dt", "fcc_cost",
    "pp_cost", "ar_cost", "ie_cost", "tax_cost", "ns_procs", "ce_procs", "rmw_procs",
    "o_procs", "non_int_upb", "prin_forg_upb_fhfa", "repch_flag", "prin_forg_upb_oth",
    "transfer_flg"];

@time jll = loadtable(
    ifiles,
    output = "data/fm.jldb/",
    delim='|',
    header_exists=false,
    filenamecol = "filename",
    #chunks = length(ifiles),
    #type_detect_rows = 20_000,
    colnames = colnames,
    colparsers = fmtypes,
    indexcols=["loan_id", "monthly_rpt_prd"]);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant