Skip to content

Convenience functions for exploratory analysis on VERIS database

License

Notifications You must be signed in to change notification settings

onlyphantom/verisr2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

verisr2

Convenience functions for exploratory analysis on VERIS database (http://veriscommunity.net).

Small helper functions for working with the data frame objects from the VERIS Community Database (VCDB), typically converted from JSON using the verisr package (or, if unavailable, from this my fork of this package). This package replicates in base R or dplyr many of the helper functions originally implemented in the verisr package by Jay Jacobs.

The original package by Jay uses data.table code that is deprecated and no longer works with recent versions of R. The author has stated his desire to one day rewrite these functions in dplyr code but since effort on that has been stagnant for a few years now this is a simple attempt to recreate these helper functions in dplyr or base R code.

Installation and Getting Started

Install it from github and load the built-in dataset:

# install devtools from https://github.com/hadley/devtools
devtools::install_github("onlyphantom/verisr2")
library(verisr2)
data(vcdb)

Inspecting the class of the data:

class(vcdb)
## [1] "verisr"     "data.frame"

Because the incidents are originally recorded in JSON, the transformed data is “wide” spanning across more than 2,430 variables as of this writing. The VERIS specification has intended for the data schema to be extended upon, and when expressed as a data frame, this wide format presents an opportunity for data analysis and exploratory exercises:

dim(vcdb)
## [1] 8198 2436

Convenience Functions

Retrieve a list of variables (enumeration / factors) in the data frame from a specified “parent”:

getenum_stri(vcdb, "action.error.vector")[1:5]
## [1] "action.error.vector.Carelessness"         
## [2] "action.error.vector.Inadequate personnel" 
## [3] "action.error.vector.Inadequate processes" 
## [4] "action.error.vector.Inadequate technology"
## [5] "action.error.vector.Other"

The same function can also be performed with a vector of (character) strings instead of a single string value:

getenum_stri(vcdb, c("actor.internal.motive", "value_chain.money laundering.variety"))[8:12]
## [1] "actor.internal.motive.NA"                 
## [2] "actor.internal.motive.Other"              
## [3] "actor.internal.motive.Secondary"          
## [4] "actor.internal.motive.Unknown"            
## [5] "value_chain.money laundering.variety.Bank"

To get a frequency table, use getenum_tbl:

getenum_tbl(vcdb, c("action", "asset.variety"))
##           action.Malware           action.Hacking            action.Social 
##                      678                     2185                      554 
##          action.Physical            action.Misuse             action.Error 
##                     1565                     1675                     2374 
##     action.Environmental           action.Unknown     asset.variety.Server 
##                        8                      237                     3819 
##    asset.variety.Network   asset.variety.User Dev      asset.variety.Media 
##                      157                     1478                     2207 
##     asset.variety.Person asset.variety.Kiosk/Term    asset.variety.Unknown 
##                      606                      345                      646 
##   asset.variety.Embedded 
##                        2

We can use getenum_df function to get both the count and the proportion of assets where data loss has occured. This replicates the original functionality from jayjacobs and vz-risk’s version but uses base R in its underlying function:

getenum_df(vcdb, "asset.variety")
##         enum    x    n    freq
## 1     Server 3819 8188 0.46641
## 2      Media 2207 8188 0.26954
## 3   User Dev 1478 8188 0.18051
## 4     Person  606 8188 0.07401
## 5 Kiosk/Term  345 8188 0.04213
## 6    Network  157 8188 0.01917
## 7   Embedded    2 8188 0.00024
## 8    Unknown  646   NA      NA

Similarly, we can pass in a vector of two characters to the function, which will count the number of incidents across the two enumerations:

getenum_df(vcdb, c("action", "asset.variety"))
## # A tibble: 64 x 3
##    action   asset.variety     x
##    <chr>    <chr>         <int>
##  1 Hacking  Server         1890
##  2 Error    Media          1395
##  3 Misuse   Server         1030
##  4 Physical User Dev        706
##  5 Error    Server          662
##  6 Social   Person          554
##  7 Physical Media           478
##  8 Malware  Server          453
##  9 Social   Server          375
## 10 Malware  User Dev        371
## # … with 54 more rows

enum2grid replicates the plotting function in jayjacobs version, and will work with all recent versions of R:

enum2grid(vcdb, c("asset.variety", "actor.external.variety"))

Another example:

enum2grid(vcdb, c("action", "asset.variety"))

importveris() is a thin wrapper over the json2veris() function. In later versions of vcdb incidents, the original function may result in a dataframe where one or more of its variables is another level of nested list object(s). This function eliminates these columns, so they’re in a more ready state for most data analysis tasks:

vcdb_small <- importveris("~/Datasets/vcdb_small/")
## [1] "veris dimensions"
## [1]    0 2437
## named integer(0)
## named integer(0)

Transform VCDB to a tidyverse-esque data frame

collapse_vcdb() takes a vcdb data frame and turns it into a more compact data frame that conforms to the “tidyverse” specifications. New features are created from the original data, using values that best represent each enumeration. An oversimplified diagram explaining this process is as follow:

tidy_vcdb <- collapse_vcdb(vcdb)
str(tidy_vcdb)
## Loading verisr2

## 'data.frame':    8198 obs. of  15 variables:
##  $ action                      : Factor w/ 9 levels "Environmental",..: 5 7 2 3 2 2 3 3 7 6 ...
##  $ action.environmental.notes  : chr  NA NA NA NA ...
##  $ action.environmental.variety: Factor w/ 4 levels "Fire","Humidity",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ action.error.notes          : chr  NA NA NA NA ...
##  $ action.error.variety        : Factor w/ 18 levels "Capacity shortage",..: 18 18 6 18 10 10 18 18 18 18 ...
##  $ action.error.vector         : Factor w/ 8 levels "Carelessness",..: 8 8 8 8 1 1 8 8 8 8 ...
##  $ action.hacking.cve          : chr  NA NA NA NA ...
##  $ action.hacking.notes        : chr  NA NA NA NA ...
##  $ action.hacking.result       : Factor w/ 5 levels "Elevate","Exfiltrate",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ action.hacking.variety      : Factor w/ 8 levels "Brute force",..: 6 6 6 5 6 6 6 6 6 6 ...
##  $ action.hacking.vector       : Factor w/ 11 levels "Backdoor or C2",..: 9 9 9 11 9 9 11 11 9 9 ...
##  $ action.malware.cve          : chr  NA NA NA NA ...
##  $ action.malware.name         : chr  NA NA NA NA ...
##  $ action.malware.notes        : chr  NA NA NA NA ...
##  $ action.malware.result       : Factor w/ 5 levels "Elevate","Exfiltrate",..: 5 5 5 5 5 5 5 5 5 5 ...

Note that the new data frame is a lot more compact, with 175 instead of the original 2,430+ variables:

dim(tidy_vcdb)
## [1] 8198  175

Where the original VCDB has a shape that resembles a “sparse matrix”, this new “tidy” data frame now has most variables as factor and numeric values. Obviously some loss of fidelity happens (a 2500-column data matrix where most values are 0 are reduced to 175-column where only the representative value is stored in each dimension / enumeration):

## 
## c("ordered", "factor")              character                 factor 
##                      1                     59                    105 
##                numeric 
##                     10

Combining with ggplot2

The data (both the originalo vcdb and its tidy variant) also works well with the rest of tidyverse. An example is to use the data in conjunction with dplyr and ggplot2:

vcdb %>%
  group_by(attribute.confidentiality.data_disclosure.Yes) %>%
  dplyr::count(timeline.incident.year) %>%
  ungroup() %>% 
  mutate(
    breach = ifelse(attribute.confidentiality.data_disclosure.Yes, 
                    "Breach", "Incident")
  ) %>% filter(
    timeline.incident.year > 2000
  ) %>% ggplot(aes(x=timeline.incident.year, y=n, group=breach)) +
  geom_col(aes(fill=breach), position = "dodge") +
  scale_x_continuous(expand=c(0,0), breaks=seq(2000, 2018, 3)) + 
  scale_y_continuous(expand=c(0,0)) + 
  scale_fill_brewer(palette = 11) + 
  labs(title="VCDB Confidentiality Breaches", caption="Confidentiality breaches where data disclosure occured"

Country-level investigation

Existing functions in this package already allow us with country-level inspection pretty effortlessly:

summary(tidy_vcdb$actor.external.country, maxsum=8)
##  Unknown       US Multiple       RU       CN       PK       SY  (Other) 
##     7320      220      180      110       43       40       36      249

We can use the collapsed dataframe (result of collapse_vcdb) to perform our inspection:

usvictim <- subset(tidy_vcdb, victim.country=="US")
head(usvictim$notes)
## [1] "lincoln financial securities Corporation is a subsidiary of Lincoln national Corporation"                                                                
## [2] "Limited information provided and there have been no follow-up articles."                                                                                 
## [3] "HHS Breach Tool"                                                                                                                                         
## [4] "The Sentry email mistake was modeled seperately. "                                                                                                       
## [5] "I can't discern who was breached here. It says DoD. But it also says satellite manufacturer. I'm assuming the latter working for DoD"                    
## [6] "The final record count was obtained from the HHS Breachtool record for this incident.  It was listed under the partner rather than Owensboro, strangely."

As of version 0.4.0, the new function involving_country() allows us to query even more effectively for all incidents where a specified country is involved:

us <- involving_country(data = vcdb, "US")
head(us$discovery_notes)
## [1] "In June 2014, Epic Systems Corp. in Verona received an email that no software company can ignore: Employees of a company working for one of its customers had gained unauthorized access to a restricted website and may have stolen documents that contained trade secrets."
## [2] "actor was arrested on drug charge and they found skimmer and cards, notified employer."                                                                                                                                                                                      
## [3] "Committed ID fraud against her own maid of honor and used her real phone number when establishing fraudulent lines of credit."                                                                                                                                               
## [4] "the FBI was investigating after 2.5GB of data taken from its servers was dumped online and swiftly shared on social media. The union's national site, fop.net, remained offline on Thursday evening"                                                                         
## [5] "Engel said, though, that the university didn’t confirm that data had been breached or learn about its apparent scope until external investigators notified officials July 31, 2018."                                                                                         
## [6] "We have disabled the malware and have reconfigured our point-of-sale and payment card processing systems to enhance the security of these systems"

By default, involving_country returns all columns of every incident involving that country. If we would like to retrieve only the columns where one or more notes are present (discovery notes, incident notes, impact notes, actor notes etc - more than 30 of such columns), then set notes_only to TRUE. The function helpfully drops any incident (rows) where no notes are present:

# only notes-type columns
us_small <- involving_country(vcdb, "US", notes_only=TRUE)
dim(us_small)
## [1] 1928   30

Credits

  • A big appreciation to Jay Jacobs for the original verisr package. While it hasn’t receive any updates in recent years, the project has been a tremendous help and starting point.

  • Thanks to the Verizon RISK Team and the community behind The VERIS Community Database

  • Thanks to Hadley Wickham, the contributors and all maintainers of packages used in this project

Contributing and Issues

The project is licensed under GPL-2. Please feel free to fork, submit pull requests or open issues.