Title: | Helpful Functions from Oliver Wyman Actuarial Consulting |
---|---|
Description: | Makes difficult operations easy. Includes these types of functions: shorthand, type conversion, data wrangling, and work flow. Also includes some helpful data objects: NA strings, U.S. state list, color blind charting colors. Built and shared by Oliver Wyman Actuarial Consulting. Accepting proposed contributions through GitHub. |
Authors: | Oliver Wyman Actuarial Consulting [aut, cph], Bryce Chamberlain [aut, cre], Rajesh Sahasrabuddhe [ctb] |
Maintainer: | Bryce Chamberlain <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.5-12 |
Built: | 2024-11-01 11:16:48 UTC |
Source: | https://github.com/oliver-wyman-actuarial/easyr |
Opposite of Author: Bryce Chamberlain.
needle %ni% haystack
needle %ni% haystack
needle |
Vector to search for. |
haystack |
Vector to search in. |
Boolean vector/value of comparisons.
c(1,3,11) %ni% 1:10
c(1,3,11) %ni% 1:10
Prints a vector as text you can copy and paste back into the code. Helpful for copying vectors into code for testing and validation. Author: Bryce Chamberlain.
astext(x)
astext(x)
x |
Vector to represent as text. |
Vector represented as a character.
astext( c( 1, 2, 4 ) ) astext( c( 'a', 'b', 'c' ) )
astext( c( 1, 2, 4 ) ) astext( c( 'a', 'b', 'c' ) )
Use easyr date and number and conversion functions to automatically convert data to the most useful type available.
atype( x, auto_convert_dates = TRUE, allow_times = FALSE, check_numbers = TRUE, nazero = FALSE, check_logical = TRUE, isexcel = TRUE, stringsAsFactors = FALSE, nastrings = easyr::nastrings, exclude = NULL, use_n_sampled_rows = min(nrow(x), 10000) )
atype( x, auto_convert_dates = TRUE, allow_times = FALSE, check_numbers = TRUE, nazero = FALSE, check_logical = TRUE, isexcel = TRUE, stringsAsFactors = FALSE, nastrings = easyr::nastrings, exclude = NULL, use_n_sampled_rows = min(nrow(x), 10000) )
x |
Data to auto-type. |
auto_convert_dates |
Choose to convert dates. |
allow_times |
Choose if you want to get times. Only use this if your data has times, otherwise there is a small chance it will prevent proper date conversion. |
check_numbers |
Choose to convert numbers. |
nazero |
Convert NAs in numeric columns to 0. |
check_logical |
Choose to convert numbers. |
isexcel |
By default, we assume this data may have come from excel. This is to assist in date conversion from excel integers. If you know it didn't and are having issues with data conversion, set this to FALSE. |
stringsAsFactors |
Convert strings/characters to factors to save compute time, RAM/memory, and storage space. |
nastrings |
Strings to consider NA. |
exclude |
Column name(s) to exclude. |
use_n_sampled_rows |
Used on large data sets. |
Author: Bryce Chamberlain.
Data frame with column types automatically converted.
# create some data in all-characters. x = data.frame( char = c( 'abc', 'def' ), num = c( '1', '2' ), date = c( '1/1/2018', '2018-2-01' ), na = c( NA, NA ), bool = c( 'TRUE', 'FALSE' ), stringsAsFactors = FALSE ) # different atype options. Note how the output types change. str( atype( x ) ) str( atype( x, exclude = 'date' ) ) str( atype( x, auto_convert_dates = FALSE ) ) str( atype( x, check_logical = FALSE ) )
# create some data in all-characters. x = data.frame( char = c( 'abc', 'def' ), num = c( '1', '2' ), date = c( '1/1/2018', '2018-2-01' ), na = c( NA, NA ), bool = c( 'TRUE', 'FALSE' ), stringsAsFactors = FALSE ) # different atype options. Note how the output types change. str( atype( x ) ) str( atype( x, exclude = 'date' ) ) str( atype( x, auto_convert_dates = FALSE ) ) str( atype( x, check_logical = FALSE ) )
Perform common operations before running a script. Includes clearing environment objects, disabling scientific notation, loading common packages, running fun/ or functions/ folders, and setting the working directory to the location of the current file.
begin( wd = NULL, load = c("magrittr", "dplyr"), keep = NULL, scipen = FALSE, verbose = TRUE, repos = "http://cran.us.r-project.org", runpath = NULL )
begin( wd = NULL, load = c("magrittr", "dplyr"), keep = NULL, scipen = FALSE, verbose = TRUE, repos = "http://cran.us.r-project.org", runpath = NULL )
wd |
Path to set as working directory. If blank, the location of the current file open in RStudio will be used if available. If FALSE, the working directory will not be changed. |
load |
Packages to load. If not available, they'll be installed. |
keep |
Environment objects to keep. If blank, all objects will be removed from the environment. |
scipen |
Do scientific notation in output? |
verbose |
Print information about what the function is doing? |
repos |
choose the URL to install from. |
runpath |
folder or file specified |
begin()
begin()
Bins a numerical column according to another numerical column's volume. For example if I want to bin a column "Age" (of people) into 10 deciles according to "CountofPeople" then I will get Age breakpoints returned by my function such that there is 10 This function handles NA's as their own separate bin, and handles any special values you want to separate out. Author: Scott Sobel. Tech Review: Bryce Chamberlain.
binbyvol(df, groupby, vol, numbins)
binbyvol(df, groupby, vol, numbins)
df |
(Data Frame) Your data. |
groupby |
(Character) Name of the column you'll create cuts in. Must be the character name of a numeric column. |
vol |
(Character) Name of the column for which which each cut will have an equal percentage of volume. |
numbins |
Number of bins to use. |
Age breakpoints returned by my function such that there is 10
# bin Sepal.Width according to Sepal.Length. iris$bin <- binbyvol(iris, 'Sepal.Width', 'Sepal.Length', 5) # check the binning success. aggregate( Sepal.Length ~ bin, data = iris, sum )
# bin Sepal.Width according to Sepal.Length. iris$bin <- binbyvol(iris, 'Sepal.Width', 'Sepal.Length', 5) # check the binning success. aggregate( Sepal.Length ~ bin, data = iris, sum )
Matches factor levels before binding rows. If factors have 0 levels it will change the column to character to avoid errors. Author: Bryce Chamberlain.
bindf(..., sort.levels = TRUE)
bindf(..., sort.levels = TRUE)
... |
data to be binded |
sort.levels |
Sort the factor levels after combining them. |
Binded data, with any factors modified to contain all levels in the binded data.
# create data where factors have different levels. df1 = data.frame( factor1 = c( 'a', 'b', 'c' ), factor2 = c( 'high', 'medium', 'low' ), factor.join = c( '0349038u093843', '304359867893753', '3409783509735' ), numeric = c( 1, 2, 3 ), logical = c( TRUE, TRUE, TRUE ) )#' df2 = data.frame( factor1 = c( 'd', 'e', 'f' ), factor2 = c( 'low', 'medium', 'high' ), factor.join = c( '32532532536', '304359867893753', '32534745876' ), numeric = c( 4, 5, 6 ), logical = c( FALSE, FALSE, FALSE ) ) # bindf preserves factors but combines levels. # factor-friendly functions default to ordered levels. str( df1 ) str( bindf( df1, df2 ) )
# create data where factors have different levels. df1 = data.frame( factor1 = c( 'a', 'b', 'c' ), factor2 = c( 'high', 'medium', 'low' ), factor.join = c( '0349038u093843', '304359867893753', '3409783509735' ), numeric = c( 1, 2, 3 ), logical = c( TRUE, TRUE, TRUE ) )#' df2 = data.frame( factor1 = c( 'd', 'e', 'f' ), factor2 = c( 'low', 'medium', 'high' ), factor.join = c( '32532532536', '304359867893753', '32534745876' ), numeric = c( 4, 5, 6 ), logical = c( FALSE, FALSE, FALSE ) ) # bindf preserves factors but combines levels. # factor-friendly functions default to ordered levels. str( df1 ) str( bindf( df1, df2 ) )
Utility function for capturing warnings.
cache.capture_warning(w)
cache.capture_warning(w)
w |
Captured warning passed by withCallingHandlers. |
# this will only have an effect if a current cache exists. ## Not run: if(!cache.ok(1)) withCallingHandlers({ x = mtcars # base-R dataset. x = mtcars # base-R dataset. warning('warning 2-1') # this is the first warning we need tdo capture. warning('warning 2-2') # this is the first warning we need tdo capture. save.cache(x) # we'll capture it inside svae.caceh }, warning = cache.capture_warning) ## End(Not run)
# this will only have an effect if a current cache exists. ## Not run: if(!cache.ok(1)) withCallingHandlers({ x = mtcars # base-R dataset. x = mtcars # base-R dataset. warning('warning 2-1') # this is the first warning we need tdo capture. warning('warning 2-2') # this is the first warning we need tdo capture. save.cache(x) # we'll capture it inside svae.caceh }, warning = cache.capture_warning) ## End(Not run)
Set cache info so easyr can manage the cache.
cache.init( caches, at.path, verbose = TRUE, save.only = FALSE, skip.missing = TRUE, n_processes = 2 )
cache.init( caches, at.path, verbose = TRUE, save.only = FALSE, skip.missing = TRUE, n_processes = 2 )
caches |
List of lists with properties name, depends.on. See example. |
at.path |
Where to save the cache. If NULL, a cache/ folder will be created in the current working directory. |
verbose |
Print via cat() information about cache operations. |
save.only |
Choose not to load the cache. Use this if you need to check cache validity in multiple spots but only want to load at the last check. |
skip.missing |
Passed to hashfiles, choose if an error occurs if a depends.on file isn't found. |
n_processes |
Passed to qs to determine how many cores/workers to use when reading/saving data. |
# initialize a cache with 1 cache which depends on files in the current working directory. # this will create a cache folder in your current working directory. # then, you call functions to check and build the cache. ## Not run: folder = system.file('extdata', package = 'easyr') cache.init( # Initial file read (raw except for renaming). caches = list( list( name = 'prep-files', depends.on = paste0(folder, '/script.R') ) ), at.path = paste0(tempdir(), '/cache') ) ## End(Not run)
# initialize a cache with 1 cache which depends on files in the current working directory. # this will create a cache folder in your current working directory. # then, you call functions to check and build the cache. ## Not run: folder = system.file('extdata', package = 'easyr') cache.init( # Initial file read (raw except for renaming). caches = list( list( name = 'prep-files', depends.on = paste0(folder, '/script.R') ) ), at.path = paste0(tempdir(), '/cache') ) ## End(Not run)
Check a cache and if necessary clear it to trigger a re-cache.
cache.ok(cache.num, do.load = TRUE)
cache.ok(cache.num, do.load = TRUE)
cache.num |
The index/number for the cache we are checking in the cache.info list. |
do.load |
Load the cache if it is found. |
Boolean indicating if the cache is acceptable. FALSE indicates the cache doesn't exist or is invalid so code should be run again.
# check the first cache to see if it exists and dependent files haven't changed. # if this is TRUE, code in brackets will get skipped and the cache will be loaded instead. # set do.load = FALSE if you have multiple files that build a cache, # to prevent multiple cache loads. # output will be printed to the console to tell you if the cache was loaded or re-built. ## Not run: if( ! cache.ok(1) ){ # do stuff # if this is the final file for this cache, # end with save.cache to save passed objects as a cache. save.cache(iris) } ## End(Not run)
# check the first cache to see if it exists and dependent files haven't changed. # if this is TRUE, code in brackets will get skipped and the cache will be loaded instead. # set do.load = FALSE if you have multiple files that build a cache, # to prevent multiple cache loads. # output will be printed to the console to tell you if the cache was loaded or re-built. ## Not run: if( ! cache.ok(1) ){ # do stuff # if this is the final file for this cache, # end with save.cache to save passed objects as a cache. save.cache(iris) } ## End(Not run)
Save Cache (Alternate)
Saves the arguments to a cache file, using the cache.num last checked with cache.ok. This function provides an alternative syntax more aligned with other functions that start with "cache.".
cache.save(...)
cache.save(...)
... |
Objects to save. |
# check the first cache to see if it exists and dependent files haven't changed. # if this check is TRUE, code in brackets will get skipped and the cache will be loaded instead. # set do.load = FALSE if you have multiple files that build a cache, # to prevent multiple cache loads. # output will be printed to the console to tell you if the cache was loaded or re-built. ## Not run: if( ! cache.ok(1) ){ # do stuff # if this is the final file for this cache, # end with cache.save to save passed objects as a cache. cache.save(iris) } ## End(Not run)
# check the first cache to see if it exists and dependent files haven't changed. # if this check is TRUE, code in brackets will get skipped and the cache will be loaded instead. # set do.load = FALSE if you have multiple files that build a cache, # to prevent multiple cache loads. # output will be printed to the console to tell you if the cache was loaded or re-built. ## Not run: if( ! cache.ok(1) ){ # do stuff # if this is the final file for this cache, # end with cache.save to save passed objects as a cache. cache.save(iris) } ## End(Not run)
Color pallette that is effective for color-blind clients.
cblind
cblind
Named vector of hex colors.
Shorthand function for paste. Author: Bryce Chamberlain.
cc(..., sep = "")
cc(..., sep = "")
... |
Arguments to be passed to paste0. Typcially a list of vectors or values to be concatenated. |
sep |
(Optional) Separator between concatenated items. |
Vector of pasted/concatenated values.
cc( 1, 2, 4 ) x = data.frame( c1 = c( 1, 2, 4 ), c2 = c( 3, 5, 7 ) ) cc( x$c1, x$c2 ) cc( x$c1, x$c2, sep = '-' )
cc( 1, 2, 4 ) x = data.frame( c1 = c( 1, 2, 4 ), c2 = c( 3, 5, 7 ) ) cc( x$c1, x$c2 ) cc( x$c1, x$c2, sep = '-' )
Convert all character columns in a data frame to factors. Author: Bryce Chamberlain.
char2fac(x, sortlevels = FALSE, na_level = "(Missing)")
char2fac(x, sortlevels = FALSE, na_level = "(Missing)")
x |
Data frame to modify. |
sortlevels |
Choose whether to sort levels. This is the default R behavior and is therefore likely faster, but it may change the order of the data and this can be problematic so the default is FALSE. |
na_level |
some functions don't like factors to have NAs so we replace NAs with this value for factors only. Set NULL to skip. |
Data frame with converted factors.
char2fac( iris )
char2fac( iris )
Checks a vector or value to see if it is a number formatted as a character. Useful for checking columns formatted with $ or commas, etc. Author: Bryce Chamberlain. Tech review: Dominic Dillingham.
charnum(x, na_strings = easyr::nastrings, run_unique = TRUE, check_date = TRUE)
charnum(x, na_strings = easyr::nastrings, run_unique = TRUE, check_date = TRUE)
x |
Vector to check. |
na_strings |
Strings to consider NA. |
run_unique |
Convert to unique variables before checking. In some cases, this can make it take longer than necessary. In most, it will make it faster. |
check_date |
Check for a date, in which case it isn't a number. If you have already checked a date and know it isn't, set this to FALSE to run faster. |
True/false value indicating if the vector is a number formatted as a character. Helpful for checking before calling easyr:tonum().
charnum( c( '123', '$50.02', '30%', '(300.01)', '-10', '1 230.4', NA, '-', '', "3.7999999999999999E-2" )) charnum( c( '123', 'abc', '30%', NA) ) # returns FALSE since this can be converted to a date: charnum( c( '20180101' ))
charnum( c( '123', '$50.02', '30%', '(300.01)', '-10', '1 230.4', NA, '-', '', "3.7999999999999999E-2" )) charnum( c( '123', 'abc', '30%', NA) ) # returns FALSE since this can be converted to a date: charnum( c( '20180101' ))
Check actual versus expected values and get helpful metrics back. Author: Bryce Chamberlain. Tech review: Lindsay Smeltzer.
checkeq( expected, actual, desc = "", acceptable_pct_diff = 0.00000001, digits = 2 )
checkeq( expected, actual, desc = "", acceptable_pct_diff = 0.00000001, digits = 2 )
expected |
The expected value of the metric. |
actual |
The actual value of the metric. |
desc |
(Optional) Description of the metric being checked. |
acceptable_pct_diff |
(Optional) Acceptable percentage difference when checking values. Checked as an absolute value. |
digits |
(Optional) Digits to round to. Without rounding you get errors from floating values. Set to NA to avoid rounding. |
Message (via cat) indicating success or errors out in case of failure.
checkeq(expected=100,actual=100,desc='A Match')
checkeq(expected=100,actual=100,desc='A Match')
Clears all caches or the cache related to the passed cache info list.
clear.cache(cache = NULL)
clear.cache(cache = NULL)
cache |
The cache list to clear. |
FALSE if a cache info list item is passed in order to assist other functions in returning this value, otherwise NULL.
# this will only have an effect if a current cache exists. ## Not run: clear.cache() ## End(Not run)
# this will only have an effect if a current cache exists. ## Not run: clear.cache() ## End(Not run)
Coalesce function that matches and updates factor levels appropriately. Checks each argument vector starting with the first until a non-NA value is found. Author: Bryce Chamberlain.
coalf(...)
coalf(...)
... |
Source vectors. |
Vector of values.
x <- sample(c(1:5, NA, NA, NA)) coalf(x, 0L)
x <- sample(c(1:5, NA, NA, NA)) coalf(x, 0L)
Concatenate arguments and run them as a command. Shorthand for eval( parse( text = paste0( ... ) ) ). Consider also using base::get() which can be used to get an object from a string, but only if it already exists. Author: Bryce Chamberlain.
crun(...)
crun(...)
... |
Character(s) to be concatenated and run as a command. |
crun( 'print(', '"hello world!"', ')') crun('T', 'RUE')
crun( 'print(', '"hello world!"', ')') crun('T', 'RUE')
Date difference (or difference in days).
ddiff(x, y, unit = "day", do.date.convert = TRUE, do.numeric = TRUE)
ddiff(x, y, unit = "day", do.date.convert = TRUE, do.numeric = TRUE)
x |
Vector of starting dates or items that can be converted to dates by todate. |
y |
Vector of ending dates or items that can be converted to dates by todate. |
unit |
Character indicating what to use as the unit of difference. Values like d, y, m or day, year, month will work. Takes just the first letter in lower-case to determine unit. |
do.date.convert |
Convert to dates before running the difference. If you know your columns are already dates, setting to FALSE will make your code run faster. |
do.numeric |
Convert the output to a number instead of a date difference object. |
Vector of differences.
ddiff( lubridate::mdy( '1/1/2018' ), lubridate::mdy( '3/4/2018' ) )
ddiff( lubridate::mdy( '1/1/2018' ), lubridate::mdy( '3/4/2018' ) )
Get information about a Data Frame or Data Table. Use getinfo to explore a single column instead. If you like, use ecopy function or agument to copy to the clipboard so that it can be pasted into Excel. Otherwise it returns a data frame. Author: Scott Sobel. Tech Review & Modifications: Bryce Chamberlain.
dict( x, topn = 5, botn = 5, na.strings = easyr::nastrings, do.atype = TRUE, ecopy = FALSE )
dict( x, topn = 5, botn = 5, na.strings = easyr::nastrings, do.atype = TRUE, ecopy = FALSE )
x |
Data Frame or Data Table. |
topn |
Number of top values to print. |
botn |
Number of bottom values to print. |
na.strings |
Strings to consider NA. |
do.atype |
Auto-determine variable types. If your data already has types set, skip this to speed up the code. |
ecopy |
Use ecopy function or agument to copy to the clipboard so that it can be pasted into Excel. |
dict(iris)
dict(iris)
Pulls all rows with duplicates in a column, not just the duplicate row. Author: Bryce Chamberlain.
drows(x, c, na = FALSE)
drows(x, c, na = FALSE)
x |
Data frame. |
c |
Column as vector or string. |
na |
Consider multiple NAs as duplicates? |
Rows from the data frame in which the column is duplicated.
ddt = bindf( cars, utils::head( cars, 10 ) ) drows( ddt, 'speed' )
ddt = bindf( cars, utils::head( cars, 10 ) ) drows( ddt, 'speed' )
Copies a data.frame or anything that can be converted into a data.frame. After running this, you can use ctrl+v or Edit > Paste to paste it to another program, typically Excel. A simple use case would be ecopy(names(df)) to copy the names of a data frame to the clipboard to paste to Excel or Outlook. Author: Scott Sobel. Tech Review: Bryce Chamberlain.
ecopy( x, showrowcolnames = c("cols", "rows", "both", "none"), show = FALSE, buffer = 1024 )
ecopy( x, showrowcolnames = c("cols", "rows", "both", "none"), show = FALSE, buffer = 1024 )
x |
Object you'd like to copy to the clipboard. |
showrowcolnames |
(Optional) Show row and column names. Choose 'none', 'cols', 'rows', or 'both'. |
show |
(Optional Boolean) Set to 'show' if you want to also print the object to the console. |
buffer |
(Optional) Set clipboard buffer size. |
ecopy( iris, showrowcolnames = "cols", show = 'show' ) ecopy(iris)
ecopy( iris, showrowcolnames = "cols", show = 'show' ) ecopy(iris)
Vectorized flexible equality comparison which considers NAs as a value. Returns TRUE if both values are NA, and FALSE when only one is NA. The standard == comparison returns NA in both of these cases and sometimes this is interpreted unexpectedly. Author: Bryce Chamberlain. Tech Review: Maria Gonzalez.
eq(x, y, do.nanull.equal = TRUE)
eq(x, y, do.nanull.equal = TRUE)
x |
First vector/value for comparison. |
y |
Second vector/value for comparison. |
do.nanull.equal |
Return TRUE if both inputs are NA or NULL (tested via easyr::nanull). |
Boolean vector/value of comparisons.
c(NA,'NA',1,2,'c') == c(NA,NA,1,2,'a') # regular equality check. eq(c(NA,'NA',1,2,'c'),c(NA,NA,1,2,'a')) # check with eq.
c(NA,'NA',1,2,'c') == c(NA,NA,1,2,'a') # regular equality check. eq(c(NA,'NA',1,2,'c'),c(NA,NA,1,2,'a')) # check with eq.
Convert all factor columns in a data frame to characters. Author: Bryce Chamberlain.
fac2char(x)
fac2char(x)
x |
Data frame to modify. |
Data frame with converted characters.
fac2char( iris )
fac2char( iris )
Matches factor levels before full join via merge. Author: Bryce Chamberlain.
fjoinf( data.left, data.right, by, sort.levels = TRUE, restrict.levels = FALSE, na_level = "(Missing)" )
fjoinf( data.left, data.right, by, sort.levels = TRUE, restrict.levels = FALSE, na_level = "(Missing)" )
data.left |
Left data. Only rows that matche the join will be included (may still result in duplication). |
data.right |
Right data. All of this data will be preservered in the join (may also result in duplication). |
by |
Columns to join on. |
sort.levels |
Sort the factor levels after combining them. |
restrict.levels |
Often the joined data won't use all the levels in both datasets. Set to TRUE to remove factor levels that aren't in the joined data. |
na_level |
some functions don't like factors to have NAs so we replace NAs with this value for factors only. Set NULL to skip. |
Joined data, with any factors modified to contain all levels in the joined data.
df1 = data.frame( factor1 = c( 'a', 'b', 'c' ), factor2 = c( 'high', 'medium', 'low' ), factor.join = c( '0349038u093843', '304359867893753', '3409783509735' ), numeric = c( 1, 2, 3 ), logical = c( TRUE, TRUE, TRUE ) ) df2 = data.frame( factor1 = c( 'd', 'e', 'f' ), factor2 = c( 'low', 'medium', 'high' ), factor.join = c( '32532532536', '304359867893753', '32534745876' ), numeric = c( 4, 5, 6 ), logical = c( FALSE, FALSE, FALSE ) ) fjoinf( df1, df2, by = 'factor.join' )
df1 = data.frame( factor1 = c( 'a', 'b', 'c' ), factor2 = c( 'high', 'medium', 'low' ), factor.join = c( '0349038u093843', '304359867893753', '3409783509735' ), numeric = c( 1, 2, 3 ), logical = c( TRUE, TRUE, TRUE ) ) df2 = data.frame( factor1 = c( 'd', 'e', 'f' ), factor2 = c( 'low', 'medium', 'high' ), factor.join = c( '32532532536', '304359867893753', '32534745876' ), numeric = c( 4, 5, 6 ), logical = c( FALSE, FALSE, FALSE ) ) fjoinf( df1, df2, by = 'factor.join' )
Get information about data files in a folder path. Use dict() on a single data frame or getinfo(0) to explore a single column. Author: Bryce Chamberlain.
fldict( folder = NULL, file.list = NULL, pattern = "^[^~]+[.](xls[xmb]?|csv|rds|xml)", ignore.case = TRUE, recursive = TRUE, verbose = FALSE, ... )
fldict( folder = NULL, file.list = NULL, pattern = "^[^~]+[.](xls[xmb]?|csv|rds|xml)", ignore.case = TRUE, recursive = TRUE, verbose = FALSE, ... )
folder |
File path of the folder to create a dictionary for. Pass either this or file.list. file.list will override this argument. |
file.list |
List of files to create a combined dictionary for. Pass either this or folder. This will ovveride folder. |
pattern |
Pattern to match files in the folder. By default we use a pattern that matches read.any-compatible data files and skips temporary Office files. Passed to list.files. |
ignore.case |
Ignore case when checking pattern. Passed to list.files. |
recursive |
Check files recursively. Passed to list.files. |
verbose |
Print helpful information. |
... |
Other arguments to read.any for reading in files. Consider using a first_column_name vector, etc. |
List with the properties:
s |
Summary data of each dataset. |
l |
Line data with a row for each column in each dataset. |
folder = system.file('extdata', package = 'easyr') fl = fldict(folder) names(fl) fl$sheets fl$columns
folder = system.file('extdata', package = 'easyr') fl = fldict(folder) names(fl) fl$sheets fl$columns
Flexible number formatter for easier formatting from numbers and dates into characters for display.
fmat( x = NULL, type = c("auto", ",", "$", "%", ".", "mdy", "ymd", "date", "dollar", "dollars", "count", "percentage", "decimal"), do.return = c("formatted", "highcharter"), digits = NULL, with.unit = FALSE, do.date.sep = "/", do.remove.spaces = FALSE, digits.cutoff = NULL )
fmat( x = NULL, type = c("auto", ",", "$", "%", ".", "mdy", "ymd", "date", "dollar", "dollars", "count", "percentage", "decimal"), do.return = c("formatted", "highcharter"), digits = NULL, with.unit = FALSE, do.date.sep = "/", do.remove.spaces = FALSE, digits.cutoff = NULL )
x |
Vector of values to convert. If retu |
type |
Type of format to return. If do.return == 'highcharter' this is not required. |
do.return |
Information to return. "formatted" returns a vector of formatted values. |
digits |
Number of digits for rounding. If left blank, the funtion will guess at the best digits. |
with.unit |
For large numbers, choose to add a suffix for fewer characters, like M for million, etc. |
do.date.sep |
Separator for date formatting. |
do.remove.spaces |
Remove extra spaces in return. |
digits.cutoff |
Amount at which to show 0 digits. Allows for flexibility of rounding. |
Information requested via do.return.
fmat( 1000, 'dollar', digits = 2 )
fmat( 1000, 'dollar', digits = 2 )
Takes bucket names of binned values such as [1e3,2e3) or [0.1234567, 0.2) and formats the values nicely into values such as 1,000-2,000 or 0.12-0.20 Author: Scott Sobel. Tech Review: Bryce Chamberlain.
getbetterint(int)
getbetterint(int)
int |
Vector of character bucket names to transform. |
Vector of transformed values.
iris$bin <- binbyvol( iris, 'Sepal.Width', 'Sepal.Length', 5 ) getbetterint( iris$bin )
iris$bin <- binbyvol( iris, 'Sepal.Width', 'Sepal.Length', 5 ) getbetterint( iris$bin )
Get information about a Column in a Data Frame or Data Table. Use getdatadict to explore all columns in a dataset instead. Author: Scott Sobel. Tech Review: Bryce Chamberlain.
getinfo( df, colname, topn = 5, botn = 5, graph = TRUE, ordered = TRUE, display = TRUE, cutoff = 20, main = NULL, cex = 0.9, xcex = 0.9, bins = 50, col = "light blue" )
getinfo( df, colname, topn = 5, botn = 5, graph = TRUE, ordered = TRUE, display = TRUE, cutoff = 20, main = NULL, cex = 0.9, xcex = 0.9, bins = 50, col = "light blue" )
df |
Data Frame or Data Table. |
colname |
(Character) Name of the column to get information about. |
topn |
(Optional) Number of top values to print. |
botn |
(Optional) Number of bottom values to print. |
graph |
(Boolean Optional) Output a chart of the column. |
ordered |
(Optional) |
display |
(Optional) |
cutoff |
(Optional) |
main |
(Optional) |
cex |
(Optional) |
xcex |
(Optional) |
bins |
(Optional) |
col |
(Optional) |
Only if display = FALSE, returns information about the column. Otherwise information comes through the graphing pane and the console (via cat/print).
getinfo(iris,'Sepal.Width') getinfo(iris,'Species')
getinfo(iris,'Sepal.Width') getinfo(iris,'Species')
Get the golden ratio. Author: Bryce Chamberlain. Tech Review: Maria Gonzalez.
gr()
gr()
The golden ratio: (1+sqrt(5)) / 2
gr()
gr()
Get a hash value representing a list of files. Useful for determining if files have changed in order to reset dependent caches.
hashfiles( x, skip.missing = FALSE, full.hash = FALSE, verbose = FALSE, skiptemp = TRUE )
hashfiles( x, skip.missing = FALSE, full.hash = FALSE, verbose = FALSE, skiptemp = TRUE )
x |
Input which specifies which files to hash. This can be a vector mix of paths and files. |
skip.missing |
Skip missing files. Default is to throw an error if a file isn't found. |
full.hash |
By default we just hash the file info (name, size, created/modified time). Set this to TRUE to read the file and hash the contents. |
verbose |
Print helpful messages from code. |
skiptemp |
Skip temporary MS Office files like "~$Simd Loss Eval 2018-06-30.xlsx" |
String representing hash of files.
folder = system.file('extdata', package = 'easyr') hashfiles(folder)
folder = system.file('extdata', package = 'easyr') hashfiles(folder)
Identify the row with headers in a data frame. It should NOT be used directly (that's why it isn't exported), but will be called by function [read.any] as necessary, with the applicable defaults set by that function.
headers_row( x, headers_on_row = NA, first_column_name = NA, field_name_map = NA )
headers_row( x, headers_on_row = NA, first_column_name = NA, field_name_map = NA )
x |
Data frame to work with. |
headers_on_row |
The specific row with headers on it. |
first_column_name |
A known column(s) that can be used to find the header row. This is more flexible, but only used if headers_on_row is not available. If multiple are possible, use a vector argument here. |
field_name_map |
field_name_map from read.any. |
List with headers_already_column_names (TRUE/FALSE); headers_on_row (1-indexed number of the to match standard R indexing).
Matches factor levels before inner join via merge. Author: Bryce Chamberlain.
ijoinf( data.left, data.right, by, sort.levels = TRUE, restrict.levels = FALSE, na_level = "(Missing)" )
ijoinf( data.left, data.right, by, sort.levels = TRUE, restrict.levels = FALSE, na_level = "(Missing)" )
data.left |
Left data. Only rows that matche the join will be included (may still result in duplication). |
data.right |
Right data. Only rows that matche the join will be included (may also result in duplication). |
by |
Columns to join on. |
sort.levels |
Sort the factor levels after combining them. |
restrict.levels |
Often the joined data won't use all the levels in both datasets. Set to TRUE to remove factor levels that aren't in the joined data. |
na_level |
some functions don't like factors to have NAs so we replace NAs with this value for factors only. Set NULL to skip. |
Joined data, with any factors modified to contain all levels in the joined data.
df1 = data.frame( factor1 = c( 'a', 'b', 'c' ), factor2 = c( 'high', 'medium', 'low' ), factor.join = c( '0349038u093843', '304359867893753', '3409783509735' ), numeric = c( 1, 2, 3 ), logical = c( TRUE, TRUE, TRUE ) ) df2 = data.frame( factor1 = c( 'd', 'e', 'f' ), factor2 = c( 'low', 'medium', 'high' ), factor.join = c( '32532532536', '304359867893753', '32534745876' ), numeric = c( 4, 5, 6 ), logical = c( FALSE, FALSE, FALSE ) ) ljoinf( df1, df2, by = 'factor.join' )
df1 = data.frame( factor1 = c( 'a', 'b', 'c' ), factor2 = c( 'high', 'medium', 'low' ), factor.join = c( '0349038u093843', '304359867893753', '3409783509735' ), numeric = c( 1, 2, 3 ), logical = c( TRUE, TRUE, TRUE ) ) df2 = data.frame( factor1 = c( 'd', 'e', 'f' ), factor2 = c( 'low', 'medium', 'high' ), factor.join = c( '32532532536', '304359867893753', '32534745876' ), numeric = c( 4, 5, 6 ), logical = c( FALSE, FALSE, FALSE ) ) ljoinf( df1, df2, by = 'factor.join' )
Shorthand for is.character
ischar(x)
ischar(x)
x |
Value to check. |
logical indicator
ischar( 'a character' ) ischar(1)
ischar( 'a character' ) ischar(1)
Shorthand for lubridate::is.Date
isdate(x)
isdate(x)
x |
Value to check. |
logical indicator
isdate( lubridate::mdy( '10/1/2014' ) ) isdate(1)
isdate( lubridate::mdy( '10/1/2014' ) ) isdate(1)
Shorthand for is.factor
isfac(x)
isfac(x)
x |
Value to check. |
logical indicator
isfac( factor( c( 'a', 'b', 'c' ) ) ) isfac(1)
isfac( factor( c( 'a', 'b', 'c' ) ) ) isfac(1)
Shorthand for is.numeric
isnum(x)
isnum(x)
x |
Value to check. |
logical indicator
isnum(1) isnum( factor( c( 'a', 'b', 'c' ) ) )
isnum(1) isnum( factor( c( 'a', 'b', 'c' ) ) )
Facilitates checking for missing values which may cause errors later in code. NULL values can cause errors on is.na checks, and is.na can cause warnings if it is inside if() and is passed multiple values. This function makes it easier to check for missing values before trying to operate on a variable. It will NOT check for strings like "" or "NA". Only NULL and NA values will return TRUE. Author: Bryce Chamberlain. Tech Review: Maria Gonzalez.
isval(x, na_strings = easyr::nastrings, do.test.each = FALSE)
isval(x, na_strings = easyr::nastrings, do.test.each = FALSE)
x |
Object to check. In the case of a data frame or vector, it will check the first (non-NULL) value. |
na_strings |
(Optional) Set the strings you want to consider NA. These will be applied after stringr::str_trim on x. |
do.test.each |
Return a vector of results to check each element instead of checking the entire object. |
True/false indicating if the argument is NA, NULL, or an empty/NA string/vector. For speect, only the first value is checked.
isval( NULL ) isval( NA ) isval( c( NA , NULL ) ) isval( c( 1, 2, 3 ) ) isval( c( NA, 2, 3 ) ) isval( c( 1, 2, NA ) ) # only the first values is checked, so this will come back FALSE. isval( c( NULL, 2, 3 ) ) # NULL values get skipped in a vector. isval( data.frame() ) isval( dplyr::group_by( dplyr::select( cars, speed, dist ), speed ) ) # test a tibble. isval( "#VALUE!" ) # test an excel error code.
isval( NULL ) isval( NA ) isval( c( NA , NULL ) ) isval( c( 1, 2, 3 ) ) isval( c( NA, 2, 3 ) ) isval( c( 1, 2, NA ) ) # only the first values is checked, so this will come back FALSE. isval( c( NULL, 2, 3 ) ) # NULL values get skipped in a vector. isval( data.frame() ) isval( dplyr::group_by( dplyr::select( cars, speed, dist ), speed ) ) # test a tibble. isval( "#VALUE!" ) # test an excel error code.
Replace a columns values with matches in a different dataset. Author: Bryce Chamberlain.
jrepl( x, y, by, replace.cols, na.only = FALSE, only.rows = NULL, verbose = FALSE, viewalldups = FALSE, warn = FALSE )
jrepl( x, y, by, replace.cols, na.only = FALSE, only.rows = NULL, verbose = FALSE, viewalldups = FALSE, warn = FALSE )
x |
Main dataset which will have new values. This data set will be returned with new values. |
y |
Supporting dataset which has the id and new values. |
by |
Vector of join column names. A character vector if the names match. A named character vector if they don't. |
replace.cols |
Vector of replacement column names, similar format as by. |
na.only |
Only replace values that are NA. |
only.rows |
Select rows to be affected. Default checks all rows. |
verbose |
Print via cat information about the replacement. |
viewalldups |
Set to TRUE to see all duplicates |
warn |
Set to TRUE to see warnings. |
x with new values.
df1 = utils::head( sleep ) group.reassign = data.frame( id.num = factor( c( 1, 3, 4 ) ), group.replace = factor( c( 99, 99, 99 ) ) ) jrepl( x = df1, y = group.reassign, by = c( 'ID' = 'id.num' ), replace.cols = c( 'group' = 'group.replace' ) ) # doesn't affect since there are no NAs in group. jrepl( x = df1, y = group.reassign, by = c( 'ID' = 'id.num' ), replace.cols = c( 'group' = 'group.replace' ), na.only = TRUE )
df1 = utils::head( sleep ) group.reassign = data.frame( id.num = factor( c( 1, 3, 4 ) ), group.replace = factor( c( 99, 99, 99 ) ) ) jrepl( x = df1, y = group.reassign, by = c( 'ID' = 'id.num' ), replace.cols = c( 'group' = 'group.replace' ) ) # doesn't affect since there are no NAs in group. jrepl( x = df1, y = group.reassign, by = c( 'ID' = 'id.num' ), replace.cols = c( 'group' = 'group.replace' ), na.only = TRUE )
Behaves like Excel's LEFT, RIGHT, and MID functions Author: Dave. Tech review: Bryce Chamberlain.
left(string, char)
left(string, char)
string |
String to process. |
char |
Number of characters. |
left( "leftmidright", 4 )
left( "leftmidright", 4 )
Check if a column can be converted to a date. Helpful for checking a column before actually converting it. Author: Bryce Chamberlain. Tech review: Dominic Dillingham.
likedate( x, na_strings = easyr::nastrings, run_unique = TRUE, aggressive.extraction = TRUE )
likedate( x, na_strings = easyr::nastrings, run_unique = TRUE, aggressive.extraction = TRUE )
x |
Value or vector to check. |
na_strings |
Vector of characters to consider NA. Like Date will treat these values like NA. |
run_unique |
Convert to unique variables before checking. In some cases, this can make it take longer than necessary. In most, it will make it faster. |
aggressive.extraction |
todate will take dates inside long strings (like filenames) and convert them to dates. This seems to be the preferred outcome, so we leave it as default (TRUE). However, if you want to avoid this you can do so via this option (FALSE). |
Boolean indicating if the entire vector can be converted to a date.
x <- c('20171124','2017/12/24',NA,'12/24/2017','March 3rd, 2015','Mar 3, 2016') likedate(x) likedate(c(123,456,NA)) if(likedate(x)) t <- todate(x) likedate(lubridate::mdy('1-1-2014')) likedate( '3312019' ) likedate( '2019.1.3' )
x <- c('20171124','2017/12/24',NA,'12/24/2017','March 3rd, 2015','Mar 3, 2016') likedate(x) likedate(c(123,456,NA)) if(likedate(x)) t <- todate(x) likedate(lubridate::mdy('1-1-2014')) likedate( '3312019' ) likedate( '2019.1.3' )
Matches factor levels before left join via merge. Author: Bryce Chamberlain.
ljoinf( data.left, data.right, by, sort.levels = TRUE, restrict.levels = FALSE, na_level = "(Missing)" )
ljoinf( data.left, data.right, by, sort.levels = TRUE, restrict.levels = FALSE, na_level = "(Missing)" )
data.left |
Left data. All of this data will be preservered in the join (may still result in duplication). |
data.right |
Right data. Only rows that matche the join will be included (may also result in duplication). |
by |
Columns to join on. |
sort.levels |
Sort the factor levels after combining them. |
restrict.levels |
Often the joined data won't use all the levels in both datasets. Set to TRUE to remove factor levels that aren't in the joined data. |
na_level |
some functions don't like factors to have NAs so we replace NAs with this value for factors only. Set NULL to skip. |
Joined data, with any factors modified to contain all levels in the joined data.
df1 = data.frame( factor1 = c( 'a', 'b', 'c' ), factor2 = c( 'high', 'medium', 'low' ), factor.join = c( '0349038u093843', '304359867893753', '3409783509735' ), numeric = c( 1, 2, 3 ), logical = c( TRUE, TRUE, TRUE ) ) df2 = data.frame( factor1 = c( 'd', 'e', 'f' ), factor2 = c( 'low', 'medium', 'high' ), factor.join = c( '32532532536', '304359867893753', '32534745876' ), numeric = c( 4, 5, 6 ), logical = c( FALSE, FALSE, FALSE ) ) ljoinf( df1, df2, by = 'factor.join' )
df1 = data.frame( factor1 = c( 'a', 'b', 'c' ), factor2 = c( 'high', 'medium', 'low' ), factor.join = c( '0349038u093843', '304359867893753', '3409783509735' ), numeric = c( 1, 2, 3 ), logical = c( TRUE, TRUE, TRUE ) ) df2 = data.frame( factor1 = c( 'd', 'e', 'f' ), factor2 = c( 'low', 'medium', 'high' ), factor.join = c( '32532532536', '304359867893753', '32534745876' ), numeric = c( 4, 5, 6 ), logical = c( FALSE, FALSE, FALSE ) ) ljoinf( df1, df2, by = 'factor.join' )
Modifies two datasets so matching factor columns have the same levels. Typically this is used prior to joining or bind_rows in the easyr functions bindf, ijoinf, lfjoinf.
match.factors(df1, df2, by = NA, sort.levels = TRUE)
match.factors(df1, df2, by = NA, sort.levels = TRUE)
df1 |
First data set. |
df2 |
Second data set. |
by |
Columns to join on, comes from the function using match.factors (ljoinf, fjoinf, ijoinf). |
sort.levels |
Sort the factor levels after combining them. |
List of the same data but with factors modified as applicable. All factors are checked if no 'by' argument is passed. Otherwise only the 'by' argument is checked.
df1 = data.frame( factor1 = c( 'a', 'b', 'c' ), factor2 = c( 'high', 'medium', 'low' ), factor.join = c( '0349038u093843', '304359867893753', '3409783509735' ), numeric = c( 1, 2, 3 ), logical = c( TRUE, TRUE, TRUE ) ) df2 = data.frame( factor1 = c( 'd', 'e', 'f' ), factor2 = c( 'low', 'medium', 'high' ), factor.join = c( '32532532536', '304359867893753', '32534745876' ), numeric = c( 4, 5, 6 ), logical = c( FALSE, FALSE, FALSE ) ) t = match.factors( df1, df2 ) levels( df1$factor1 ) levels( t[[1]]$factor1 ) levels( t[[2]]$factor1 )
df1 = data.frame( factor1 = c( 'a', 'b', 'c' ), factor2 = c( 'high', 'medium', 'low' ), factor.join = c( '0349038u093843', '304359867893753', '3409783509735' ), numeric = c( 1, 2, 3 ), logical = c( TRUE, TRUE, TRUE ) ) df2 = data.frame( factor1 = c( 'd', 'e', 'f' ), factor2 = c( 'low', 'medium', 'high' ), factor.join = c( '32532532536', '304359867893753', '32534745876' ), numeric = c( 4, 5, 6 ), logical = c( FALSE, FALSE, FALSE ) ) t = match.factors( df1, df2 ) levels( df1$factor1 ) levels( t[[1]]$factor1 ) levels( t[[2]]$factor1 )
Date Difference in Months
mdiff(x, y, do.date.convert = TRUE, do.numeric = TRUE)
mdiff(x, y, do.date.convert = TRUE, do.numeric = TRUE)
x |
Vector of starting dates or items that can be converted to dates by todate. |
y |
Vector of ending dates or items that can be converted to dates by todate. |
do.date.convert |
Convert to dates before running the difference. If you know your columns are already dates, setting to FALSE will make your code run faster. |
do.numeric |
Convert the output to a number instead of a date difference object. |
Vector of differences.
mdiff( lubridate::mdy( '1/1/2018' ), lubridate::mdy( '3/4/2018' ) )
mdiff( lubridate::mdy( '1/1/2018' ), lubridate::mdy( '3/4/2018' ) )
Behaves like Excel's LEFT, RIGHT, and MID functions Author: Bryce Chamberlain.
mid(string, start, nchars)
mid(string, start, nchars)
string |
String to process. |
start |
Index (1-index) to start at. |
nchars |
Number of characters to read in from start. |
mid( "leftmidright", 5, 3 )
mid( "leftmidright", 5, 3 )
Shorthand for is.na
na(x)
na(x)
x |
Value to check. |
logical indicator
na(NA) na(1)
na(NA) na(1)
Get column names that match a pattern. Author: Scott Sobel. Tech review: Bryce Chamberlain.
namesx(df, char, fixed = TRUE, ignore.case = TRUE)
namesx(df, char, fixed = TRUE, ignore.case = TRUE)
df |
Object with names you'd like to search. |
char |
Regex chracter to match to columns. |
fixed |
Match as a string, not a regular expression. |
ignore.case |
Ignore case in matches. |
Vector of matched names.
namesx( iris,'len' ) namesx( iris,'Len' )
namesx( iris,'len' ) namesx( iris,'Len' )
Shorthand for is.nan
nan(x)
nan(x)
x |
Value to check. |
logical indicator
nan( NaN ) nan(1)
nan( NaN ) nan(1)
Facilitates checking for missing values which may cause errors later in code. NULL values can cause errors on is.na checks, and is.na can cause warnings if it is inside if() and is passed multiple values. This function makes it easier to check for missing values before trying to operate on a variable. It will NOT check for strings like "" or "NA". Only NULL and NA values will return TRUE. Author: Bryce Chamberlain. Tech Review: Maria Gonzalez.
nanull(x, na_strings = easyr::nastrings, do.test.each = FALSE)
nanull(x, na_strings = easyr::nastrings, do.test.each = FALSE)
x |
Vector to check. In the case of a data frame or vector, it will check the first (non-NULL) value. |
na_strings |
(Optional) Set the strings you want to consider NA. These will be applied after stringr::str_trim on x. |
do.test.each |
Return a vector of results to check each element instead of checking the entire object. |
True/false indicating if the argument is NA, NULL, or an empty/NA string/vector. For speect, only the first value is checked.
nanull( NULL ) nanull( NA ) nanull( c( NA , NULL ) ) nanull( c( 1, 2, 3 ) ) nanull( c( NA, 2, 3 ) ) nanull( c( 1, 2, NA ) ) # only the first values is checked, so this will come back FALSE. nanull( c( NULL, 2, 3 ) ) # NULL values get skipped in a vector. nanull( data.frame() ) nanull( dplyr::group_by( dplyr::select( cars, speed, dist ), speed ) ) # test a tibble.
nanull( NULL ) nanull( NA ) nanull( c( NA , NULL ) ) nanull( c( 1, 2, 3 ) ) nanull( c( NA, 2, 3 ) ) nanull( c( 1, 2, NA ) ) # only the first values is checked, so this will come back FALSE. nanull( c( NULL, 2, 3 ) ) # NULL values get skipped in a vector. nanull( data.frame() ) nanull( dplyr::group_by( dplyr::select( cars, speed, dist ), speed ) ) # test a tibble.
A list of strings to consider NA. Includes blank string, "NA", excel errors, etc. Used throughout easyr for checking NA.
nastrings
nastrings
A vector of values.
Shorthand for is.null
null(x)
null(x)
x |
Value to check. |
logical indicator
null( NULL ) null(1)
null( NULL ) null(1)
Adds leading zeros to a numeric vector to make each value a specific length. For values shorter than length passed, leading zeros are removed. Author: Scott Sobel. Tech Review: Bryce Chamberlain.
pad0(x, len)
pad0(x, len)
x |
Vector. |
len |
Number of characters you want in each value. |
Character vector with padded values.
pad0( c(123,00123,5), len = 5 ) pad0( c(123,00123,5), len = 2 ) pad0( '1234', 5 )
pad0( c(123,00123,5), len = 5 ) pad0( c(123,00123,5), len = 2 ) pad0( '1234', 5 )
Date Difference in Quarters
qdiff(x, y, do.date.convert = TRUE, do.numeric = TRUE)
qdiff(x, y, do.date.convert = TRUE, do.numeric = TRUE)
x |
Vector of starting dates or items that can be converted to dates by todate. |
y |
Vector of ending dates or items that can be converted to dates by todate. |
do.date.convert |
Convert to dates before running the difference. If you know your columns are already dates, setting to FALSE will make your code run faster. |
do.numeric |
Convert the output to a number instead of a date difference object. |
Vector of differences.
qdiff( lubridate::mdy( '1/1/2018' ), lubridate::mdy( '3/4/2018' ) )
qdiff( lubridate::mdy( '1/1/2018' ), lubridate::mdy( '3/4/2018' ) )
Code to fix column names, since this has to be done up to twice will reading in files. It should NOT be used directly (that's why it isn't exported), but will be called by function [read.any] as necessary, with the applicable defaults set by that function.
rany_fixColNames(col_names, fix.dup.column.names, nastrings)
rany_fixColNames(col_names, fix.dup.column.names, nastrings)
col_names |
Vector/value of colum names/name. |
fix.dup.column.names |
Adds 'DUPLICATE #' to duplicated column names to avoid errors with duplicate names. |
nastrings |
Characters/strings to read as NA. |
Fixed names.
Flexible read function to handle many types of files. Currently handles CSV, TSV, DBF, RDS, XLS (incl. when formatted as HTML), and XLSX. Also handles common issues like strings being read in as factors (strings are NOT read in as factors by this function, you'd need to convert them later). Author: Bryce Chamberlain. Tech Review: Dominic Dillingham.
read.any( filename = NA, folder = NA, sheet = 1, file_type = "", first_column_name = NA, header = is.null(widths), headers_on_row = NA, nrows = -1L, row.names.column = NA, row.names.remove = TRUE, make.names = FALSE, field_name_map = NA, require_columns = NA, all_chars = FALSE, auto_convert_dates = TRUE, allow_times = FALSE, check_numbers = TRUE, nazero = FALSE, check_logical = TRUE, stringsAsFactors = FALSE, na_strings = easyr::nastrings, na_level = "(Missing)", ignore_rows_with_na_at = NA, drop.na.cols = TRUE, drop.na.rows = TRUE, fix.dup.column.names = TRUE, do.trim.sheetname = TRUE, x = NULL, isexcel = FALSE, encoding = "unknown", verbose = TRUE, widths = NULL, col.names = NULL )
read.any( filename = NA, folder = NA, sheet = 1, file_type = "", first_column_name = NA, header = is.null(widths), headers_on_row = NA, nrows = -1L, row.names.column = NA, row.names.remove = TRUE, make.names = FALSE, field_name_map = NA, require_columns = NA, all_chars = FALSE, auto_convert_dates = TRUE, allow_times = FALSE, check_numbers = TRUE, nazero = FALSE, check_logical = TRUE, stringsAsFactors = FALSE, na_strings = easyr::nastrings, na_level = "(Missing)", ignore_rows_with_na_at = NA, drop.na.cols = TRUE, drop.na.rows = TRUE, fix.dup.column.names = TRUE, do.trim.sheetname = TRUE, x = NULL, isexcel = FALSE, encoding = "unknown", verbose = TRUE, widths = NULL, col.names = NULL )
filename |
File path and name for the file to be read in. |
folder |
Folder path to look for the file in. |
sheet |
The sheet to read in. |
file_type |
Specify the file type (CSV, TSV, DBF, FWF). If not provided, R will use the file extension to determine the file type. Useful when the file extension doesn't indicate the file type, like .rpt, etc. |
first_column_name |
Define headers location by providing the name of the left-most column. Alternatively, you can choose the row via the [headers_on_row] argument. |
header |
Choose if your file contains headers. |
headers_on_row |
Choose a specific row number to use as headers. Use this when you want to tell read.any exactly where the headers are. |
nrows |
Number of rows to read. Leave blank/NA to read all rows. This only speeds up file reads (CSV, XLSX, etc.), not compressed data that must be read all at once. This is applied BEFORe headers_on_row or first_column_name removes top rows, so it should be greater than those values if headers aren't in the first row. |
row.names.column |
Specify the column (by character name) to use for row names. This drops the columns and lets rows be referenced directly with this id. Must be unique values. |
row.names.remove |
If you move a column to row names, it is removed from the data by default. If you'd like to keep it, set this to FALSE. |
make.names |
Apply make.names function to make column names R-friendly (replaces non-characters with ., starting numbers with x, etc.) |
field_name_map |
Rename fields for consistency. Provide as a named vector where the names are the file's names and the vector values are the output names desired. See examples for how to create this input. |
require_columns |
List of required columns to check for. Calls stop() with helpful message if any aren't found. |
all_chars |
Keep all column types as characters. This makes using bind_rows easer, then you can use atype() later to set types. |
auto_convert_dates |
Identify date fields and automatically convert them to dates |
allow_times |
imes are not allowed in reading data in to facilitate easy binding. If you need times though, set this to TRUE. |
check_numbers |
Identfy numbers formatted as characters and convert them as such. |
nazero |
Convert NAs in numeric columns to 0. |
check_logical |
Identfy logical columns formatted as characters (Yes/No, etc) or numbers (0,1) and convert them as such. |
stringsAsFactors |
Convert characters to factors to increase processing speed and reduce file size. |
na_strings |
Strings to treat like NA. By default we use the easyr NA strings. |
na_level |
dplyr doesn't like factors to have NAs so we replace NAs with this value for factors only. Set NULL to skip. |
ignore_rows_with_na_at |
Vector or value, numeric or character, identifying column(s) that require a value. read.any will remove these rows after colname swaps and read, before type conversion. Especially helpful for removing things like page numbers at the bottom of an excel report that break type discovery. Suggest using the claim number column here. |
drop.na.cols |
Drop columns with only NA values. |
drop.na.rows |
Drop rows with only NA values. |
fix.dup.column.names |
Adds 'DUPLICATE #' to duplicated column names to avoid issues with multiple columns having the same name. |
do.trim.sheetname |
read.any will trim sheet names to get better matches. This will cause an error if the actual sheet name has spaces on the left or right side. Disable this functionality here. |
x |
If you want to use read.any functionality on an existing data frame, pass it with this argument. |
isexcel |
If you want to use read.any functionality on an existing data frame, you can tell read.any that this data came from excel using isexcel manually. This comes in handy when excel-integer date conversions are necessary. |
encoding |
Encoding passed to fread and read.csv. |
verbose |
Print helpful information via cat. |
widths |
Column widths. Only use for fixed width files. |
col.names |
Column names. Only use for fixed width files. |
Data frame with the data that was read.
folder = system.file('extdata', package = 'easyr') read.any('date-time.csv', folder = folder) # if dates are being converted incorrectly, disable date conversion: read.any('date-time.csv', folder = folder, auto_convert_dates = FALSE) # to handle type conversions manually: read.any('date-time.csv', folder = folder, all_chars = TRUE)
folder = system.file('extdata', package = 'easyr') read.any('date-time.csv', folder = folder) # if dates are being converted incorrectly, disable date conversion: read.any('date-time.csv', folder = folder, auto_convert_dates = FALSE) # to handle type conversions manually: read.any('date-time.csv', folder = folder, all_chars = TRUE)
Read File as Text
read.txt(filename, folder = NA)
read.txt(filename, folder = NA)
filename |
File path and name for the file to be read in. |
folder |
Folder path to look for the file in. |
Character variable containing the text in the file.
# write a files. path = tempfile() cat( "some text", file = path ) # read the file. read.txt( path ) # cleanum. file.remove( path )
# write a files. path = tempfile() cat( "some text", file = path ) # read the file. read.txt( path ) # cleanum. file.remove( path )
Behaves like Excel's LEFT, RIGHT, and MID functions Author: Dave. Tech review: Bryce Chamberlain.
right(string, char)
right(string, char)
string |
String to process. |
char |
Number of characters. |
right( "leftmidright",5 )
right( "leftmidright",5 )
Run all the R scripts in a folder. Author: Bryce Chamberlain.
runfolder( path, recursive = FALSE, is.local = TRUE, check.fn = NULL, run.files = NULL, verbose = TRUE, edit.on.err = TRUE, pattern = "[.][Rr]$" )
runfolder( path, recursive = FALSE, is.local = TRUE, check.fn = NULL, run.files = NULL, verbose = TRUE, edit.on.err = TRUE, pattern = "[.][Rr]$" )
path |
Folder to run. |
recursive |
Run all folder children also. |
is.local |
Code is running on a local machine, not a Shiny server. Helpful for skipping items that can be problematic on the server. In this case, printing to the log. |
check.fn |
Function to run after reach file is read-in. |
run.files |
Optionally pass the list of files to run. Otherwise, list.files will be run on the folder. |
verbose |
Print names of files and run-time via cat. |
edit.on.err |
Open the running file if an error occurs. |
pattern |
Passed to list.files. Pattern to match/filter files. |
# runfolder( 'R' )
# runfolder( 'R' )
This gets a bit complex since many errors can occur when reading in excel files. We've done our best to handle common ones. Requires packages: openxlsx, readxl, XML (these are required by easyr). It should NOT be used directly (that's why it isn't exported), but will be called by function [read.any] as necessary, with the applicable defaults set by that function.
rx(filename, sheet, first_column_name, nrows, verbose)
rx(filename, sheet, first_column_name, nrows, verbose)
filename |
File path and name for the file to be read in. |
sheet |
The sheet to read in. |
first_column_name |
Pass a column name to help the function find the header row. |
nrows |
Number of rows to read in. |
verbose |
Print helpful messages via cat(). |
Data object
Save Cache
Saves the arguments to a cache file, using the cache.num last checked with cache.ok.
save.cache(...)
save.cache(...)
... |
Objects to save. |
# check the first cache to see if it exists and dependent files haven't changed. # if this check is TRUE, code in brackets will get skipped and the cache will be loaded instead. # set do.load = FALSE if you have multiple files that build a cache, # to prevent multiple cache loads. # output will be printed to the console to tell you if the cache was loaded or re-built. ## Not run: if( ! cache.ok(1) ){ # do stuff # if this is the final file for this cache, # end with save.cache to save passed objects as a cache. save.cache(iris) } ## End(Not run)
# check the first cache to see if it exists and dependent files haven't changed. # if this check is TRUE, code in brackets will get skipped and the cache will be loaded instead. # set do.load = FALSE if you have multiple files that build a cache, # to prevent multiple cache loads. # output will be printed to the console to tell you if the cache was loaded or re-built. ## Not run: if( ! cache.ok(1) ){ # do stuff # if this is the final file for this cache, # end with save.cache to save passed objects as a cache. save.cache(iris) } ## End(Not run)
Searches all columns for a term and returns all rows with at least one match. Author: Bryce Chamberlain.
sch( x, pattern, ignore.case = FALSE, fixed = FALSE, pluscols = NULL, exact = FALSE, trim = TRUE, spln = NULL )
sch( x, pattern, ignore.case = FALSE, fixed = FALSE, pluscols = NULL, exact = FALSE, trim = TRUE, spln = NULL )
x |
Data to search. |
pattern |
Regex patter to search. Most normal search terms will work fine, too. |
ignore.case |
Ignore case in search (uses grepl). |
fixed |
Passed to grepl to match string as-is instead of using regex. See ?grepl. |
pluscols |
choose columns to return in addition to those where matches are found. Can be a name, number or 'all' to bring back all columns. |
exact |
Find exact matches intead of pattern matching. |
trim |
Use trimws to trim columns before exact matching. |
spln |
Sample data use easyr::spl() before searching. This will speed up searching in large datasets when you only need to identify columns, not all data that matches. See ?spl n argument for more info. |
Matching rows.
sch( iris, 'seto' ) sch( iris, 'seto', pluscols='all' ) sch( iris, 'seto', pluscols='Sepal.Width' ) sch( iris, 'seto', exact = TRUE ) # message no matches and return NULL
sch( iris, 'seto' ) sch( iris, 'seto', pluscols='all' ) sch( iris, 'seto', pluscols='Sepal.Width' ) sch( iris, 'seto', exact = TRUE ) # message no matches and return NULL
Search for similar strings using in a vector.
similar_text( search, context, algo = "jaccard", level = 0.5, return_similarity = FALSE )
similar_text( search, context, algo = "jaccard", level = 0.5, return_similarity = FALSE )
search |
Single character/string to search for. |
context |
Vector of characters to search within. |
algo |
Algorithm to use when determining similarity. Currenly, only Jaccard Similarity is implemented. |
level |
Returned characters will be this similar or more similar. Higher values will return fewer/closer matches. |
return_similarity |
Special option for diagnosing. TRUE will ignore [level] and return a named vector where the name is the context value and the value is the similarity. |
Characters that meet the similarity requirement.
similar_text('foobar', c('foo', 'bar', 'foobars')) similar_text('foobar', c('foo', 'bar', 'foobars'), return_similarity = TRUE)
similar_text('foobar', c('foo', 'bar', 'foobars')) similar_text('foobar', c('foo', 'bar', 'foobars'), return_similarity = TRUE)
Extracts a uniform random sample from a dataset or vector. Provides a simpler API than base R. Author: Bryce Chamberlain. Tech Review: Maria Gonzalez.
spl(x, n = 10, warn = TRUE, replace = FALSE, seed = NULL, ...)
spl(x, n = 10, warn = TRUE, replace = FALSE, seed = NULL, ...)
x |
Data to sample from. |
n |
Number or percentage of rows/values to return. If less than 1 it will be interpreted as a percentage. |
warn |
Warn if sampling more than the size of the data. |
replace |
Whether or not to sample with replacement. |
seed |
Set a seed to allow consistent/replicable sampling. |
... |
Other parameters passed to sample() |
Sample dataframe/vector.
spl( c(1:100) ) spl( c(1:100), n = 50 ) spl( iris )
spl( c(1:100) ) spl( c(1:100), n = 50 ) spl( iris )
Helpful info for states. Right now, just a mapping of abbreviations to names.
states
states
Data frame.
Runs str function but only for names matching a character value (regex). Author: Scott Sobel. Tech Review: Bryce Chamberlain.
strx(df, char, ignore.case = T)
strx(df, char, ignore.case = T)
df |
Object with names you'd like to search. |
char |
Regex (character value) to match. |
ignore.case |
(Optional) Ignore case when matching. |
strx(iris,'length')
strx(iris,'length')
Easily summarize at all numeric variables. Helpful for flexibly summarizing without knowing the columns. Defaults to sum but you can send a custom function through also. Typically pass in a data frame after group_by.
sumnum(x, do.fun = NULL, except = c(), do.ungroup = TRUE, ...)
sumnum(x, do.fun = NULL, except = c(), do.ungroup = TRUE, ...)
x |
Grouped tibble to summarize. |
do.fun |
Function to use for the summary. Passed to dplyr::summarize(). Can be a custom function. Defaults to sum(). |
except |
Columns names, numbers, or a logical vector indicating columns NOT to summarize. |
do.ungroup |
Run dplyr::ungroup() after summarizing the prevent future issues with grouping. |
... |
Extra args passed to dplyr::summarize() which are applied as arguments to the function passed in do.fun. |
Summarized data frame or tibble.
require(dplyr) require(easyr) sumnum( group_by( cars, speed ) ) sumnum( group_by( cars, speed ), mean ) sumnum( cars )
require(dplyr) require(easyr) sumnum( group_by( cars, speed ) ) sumnum( group_by( cars, speed ), mean ) sumnum( cars )
Easy Try/Catch implementation to return the same message on error or warning. Makes it easier to write tryCatches. Author: Bryce Chamberlain. Tech review: Lindsay Smelzter.
tcmsg(code_block, ...)
tcmsg(code_block, ...)
code_block |
Code to run in Try Catch. |
... |
Strings to concatenate to form the message that is returned. |
tryCatch({ tcmsg({ NULL = 1 }, 'Cannot assign to NULL','variable' ) }, error = function(e) print( e ) ) tryCatch({ tcmsg({ as.numeric('abc') },'Issue in as.numeric()') }, warning = function(e) print( e ) )
tryCatch({ tcmsg({ NULL = 1 }, 'Cannot assign to NULL','variable' ) }, error = function(e) print( e ) ) tryCatch({ tcmsg({ as.numeric('abc') },'Issue in as.numeric()') }, warning = function(e) print( e ) )
Transpose operation that sets column names equal to a column in the original data. Author: Bryce Chamberlain.
tcol(x, header, cols.colname = "col", do.atype = TRUE)
tcol(x, header, cols.colname = "col", do.atype = TRUE)
x |
Data frame to be transposed. |
header |
Column name/number to be used as column names of transposed data. |
cols.colname |
Name to use for the column of column names in the transposed data. |
do.atype |
Transpose convertes to strings, since data types are uncertain. Run atype to automatically correct variable typing where possible. This will slow the result a bit. |
Transposed data frame.
# create a summary dataset from iris. x = dplyr::summarize_at( dplyr::group_by( iris, Species ), dplyr::vars( Sepal.Length, Sepal.Width ), list(sum) ) # run tcol tcol( x, 'Species' )
# create a summary dataset from iris. x = dplyr::summarize_at( dplyr::group_by( iris, Species ), dplyr::vars( Sepal.Length, Sepal.Width ), list(sum) ) # run tcol tcol( x, 'Species' )
Easy Try/Catch implementation to return the same message as a warning on error or warning. Makes it easier to write tryCatches. Author: Bryce Chamberlain. Tech review: Lindsay Smelzter.
tcwarn(code_block, ...)
tcwarn(code_block, ...)
code_block |
Code to run in Try Catch. |
... |
Strings to concatenate to form the message that is returned. |
tryCatch({ tcwarn({ NULL = 1 },'Cannot assign to NULL','variable') }, warning = function(e) print( e ) ) tryCatch({ tcwarn({ as.numeric('abc') },'Issue in as.numeric()') }, warning = function(e) print( e ) )
tryCatch({ tcwarn({ NULL = 1 },'Cannot assign to NULL','variable') }, warning = function(e) print( e ) ) tryCatch({ tcwarn({ as.numeric('abc') },'Issue in as.numeric()') }, warning = function(e) print( e ) )
Flexible boolean conversion. Author: Bryce Chamberlain.
tobool( x, preprocessed.values = NULL, nastrings = easyr::nastrings, ifna = c("return-unchanged", "error", "warning", "return-na"), verbose = TRUE, true.vals = c("true", "1", "t", "yes"), false.vals = c("false", "0", "f", "no") )
tobool( x, preprocessed.values = NULL, nastrings = easyr::nastrings, ifna = c("return-unchanged", "error", "warning", "return-na"), verbose = TRUE, true.vals = c("true", "1", "t", "yes"), false.vals = c("false", "0", "f", "no") )
x |
Value or vector to be converted. |
preprocessed.values |
Strings need to have NAs set, lowercase and be trimmed before they can be checked. To avoid doing this multiple times, you can pass these processed values to the function. |
nastrings |
Vector of characters to be considered NAs. todate will treat these like NAs. Defaults to the easyr::nastrings list. |
ifna |
Action to take if NAs are created. 'return-unchanged' returns the sent vector unchanged; 'warning' results in a warning and returns the converted vector with new NAs; 'error' results in an error. |
verbose |
Choose to view messaging. |
true.vals |
Values to consider as TRUE. |
false.vals |
Values to consider as FALSE. |
Converted logical vector.
tobool( c( 'true', 'FALSE', 0, 1, NA, 'yes', 'NO' ) )
tobool( c( 'true', 'FALSE', 0, 1, NA, 'yes', 'NO' ) )
Shorthand for as.character
tochar(x)
tochar(x)
x |
Value to check. |
as.character result
tochar(NA) tochar(1)
tochar(NA) tochar(1)
Flexible date conversion function using lubridate. Works with dates in many formats, without needing to know the format in advance. Only use this if you don't know the format of the dates before hand. Otherwise, lubridate functions parse_date_time, mdy, etc. should be used. Author: Bryce Chamberlain. Tech review: Dominic Dillingham.
todate( x, nastrings = easyr::nastrings, aggressive.extraction = TRUE, preprocessed.values = NULL, ifna = c("return-unchanged", "error", "warning", "return-na"), verbose = TRUE, allow_times = FALSE, do.month.char = TRUE, do.excel = TRUE, min.acceptable = lubridate::ymd("1920-01-01"), max.acceptable = lubridate::ymd("2050-01-01") )
todate( x, nastrings = easyr::nastrings, aggressive.extraction = TRUE, preprocessed.values = NULL, ifna = c("return-unchanged", "error", "warning", "return-na"), verbose = TRUE, allow_times = FALSE, do.month.char = TRUE, do.excel = TRUE, min.acceptable = lubridate::ymd("1920-01-01"), max.acceptable = lubridate::ymd("2050-01-01") )
x |
Value or vector to be converted. |
nastrings |
Vector of characters to be considered NAs. todate will treat these like NAs. Defaults to the easyr::nastrings list. |
aggressive.extraction |
todate will take dates inside long strings (like filenames) and convert them to dates. This seems to be the preferred outcome, so we leave it as default (TRUE). However, if you want to avoid this you can do so via this option (FALSE). |
preprocessed.values |
Strings need to have NAs set, lowercase and be trimmed before they can be checked. To avoid doing this multiple times, you can pass these processed values to the function. |
ifna |
Action to take if NAs are created. 'return-unchanged' returns the sent vector unchanged; 'warning' results in a warning and returns the converted vector with new NAs; 'error' results in an error; 'return-na' returns new NAs without a warning. |
verbose |
Choose to view messaging. |
allow_times |
Set to TRUE to allow DateTimes as output, otherwise this will always convert to Dates (losing time information). This is better for binding data, hence the default FALSE. |
do.month.char |
Attempt to convert month names in text. lubridate does this by default, but sometimes it can result in inaccurate dates. For example, "Feb 2017" is converted to 2-20-2017 even though no day was given. |
do.excel |
Check for excel-formatted numbers. |
min.acceptable |
Set NA if converted value is less than this value. Helps to prevent numbers from being assumed as dates. Set NULL to skip this check. Does not affect character conversions. |
max.acceptable |
Set NA if converted value is greater than this value. Helps to prevent numbers from being assumed as dates. Set NULL to skip this check. Does not affect character conversions. |
Converted vector using lubridate::parse_date_time(x,c('mdy','ymd','dmy'))
x <- c( '20171124', '2017/12/24', NA, '12/24/2017', '5/11/2017 1:51PM' ) x2 <- todate(x) x2
x <- c( '20171124', '2017/12/24', NA, '12/24/2017', '5/11/2017 1:51PM' ) x2 <- todate(x) x2
Flexible number conversion for converting strings to numbers. Handles $ , ' and spaces. Author: Bryce Chamberlain. Tech review: Dominic Dillingham.
tonum( x, preprocessed.values = NULL, nastrings = easyr::nastrings, ifna = c("return-unchanged", "error", "warning", "return-na"), verbose = TRUE, nazero = FALSE, checkdate = TRUE, remove.chars = FALSE, do.logical = TRUE, do.try.integer = TRUE, multipliers = c(`%` = 1/100, K = 1000, M = 1000^2, B = 1000^3) )
tonum( x, preprocessed.values = NULL, nastrings = easyr::nastrings, ifna = c("return-unchanged", "error", "warning", "return-na"), verbose = TRUE, nazero = FALSE, checkdate = TRUE, remove.chars = FALSE, do.logical = TRUE, do.try.integer = TRUE, multipliers = c(`%` = 1/100, K = 1000, M = 1000^2, B = 1000^3) )
x |
Vector to convert. |
preprocessed.values |
Strings need to have NAs set, lowercase and be trimmed before they can be checked. To avoid doing this multiple times, you can pass these processed values to the function. |
nastrings |
Vector of characters to be considered NAs. todate will treat these like NAs. Defaults to the easyr::nastrings list. |
ifna |
Action to take if NAs are created. 'return-unchanged' returns the sent vector unchanged; 'warning' results in a warning and returns the converted vector with new NAs; 'error' results in an error; return-na returns data with new NAs and prints via cat if verbose. |
verbose |
Choose to view messaging. |
nazero |
(Optional) Convert NAs to 0. Defaults to TRUE, if FALSE NAs will stay NA. |
checkdate |
Check if the column is a date first. If this has already been done, set this to FALSE so it doesn't run again. |
remove.chars |
Remove characters for aggressive conversion to numbers. |
do.logical |
Check for logical-form vectors. |
do.try.integer |
Return an integer if possible. Integers are a more compact data type and should be used whenever possible. |
multipliers |
Named vector of factor symbols and values to check. Setting to NULL may speed up operations. |
Converted vector.
tonum( c('123','$50.02','30%','(300.01)',NA,'-','') ) tonum( c('123','$50.02','30%','(300.01)',NA,'-',''), nazero = FALSE ) tonum( c( '$(3,891)M', '4B', '3.41K', '30', '40K' ) )
tonum( c('123','$50.02','30%','(300.01)',NA,'-','') ) tonum( c('123','$50.02','30%','(300.01)',NA,'-',''), nazero = FALSE ) tonum( c( '$(3,891)M', '4B', '3.41K', '30', '40K' ) )
Installs a package if it needs to be installed, and calls require to load the package. Author: Scott Sobel. Tech Review: Bryce Chamberlain.
usepkg(packages, noCache = FALSE, repos = "http://cran.us.r-project.org")
usepkg(packages, noCache = FALSE, repos = "http://cran.us.r-project.org")
packages |
Character or character vector with names of the packages you want to use. |
noCache |
When checking packages, you can choose to ignore the cached list, which will increase accuracy but decrease speed. |
repos |
choose the URL to install from. |
# packages shouldn't be installed during tests or examples according to CRAN. # therefore, examples cannot be provided because CRAN now runs donttest examples. usepkg('geodist', FALSE, 'http://cran.us.r-project.org')
# packages shouldn't be installed during tests or examples according to CRAN. # therefore, examples cannot be provided because CRAN now runs donttest examples. usepkg('geodist', FALSE, 'http://cran.us.r-project.org')
Check various properties of 2 data frames to ensure they are equivalent.
validate.equal( df1, df2, id.column = NULL, regex.remove = "[^A-z0-9.+\\/,-]", do.set.NA = TRUE, nastrings = easyr::nastrings, match.round.to.digits = 4, do.all.columns.before.err = FALSE, check.column.order = FALSE, sort.by.id = TRUE, acceptable.pct.rows.diff = 0, acceptable.pct.vals.diff = 0, return.summary = FALSE, verbose = TRUE )
validate.equal( df1, df2, id.column = NULL, regex.remove = "[^A-z0-9.+\\/,-]", do.set.NA = TRUE, nastrings = easyr::nastrings, match.round.to.digits = 4, do.all.columns.before.err = FALSE, check.column.order = FALSE, sort.by.id = TRUE, acceptable.pct.rows.diff = 0, acceptable.pct.vals.diff = 0, return.summary = FALSE, verbose = TRUE )
df1 |
First data frame to compare. |
df2 |
Second data frame to compare. |
id.column |
If available, a column to use as an ID. Helpful in various checks and output. |
regex.remove |
Pattern to remove from strings. Used in gsub to remove characters we don't want to consider when comparing values. Set to NULL, NA, or "" to leave strings unchanged. |
do.set.NA |
Remove NA strings. |
nastrings |
Strings to consider NA. |
match.round.to.digits |
Round numbers to these digits before checking equality. |
do.all.columns.before.err |
Check all columns before returning an error. Takes longer but returns more detail. If FALSE, stops at first column that doesn't match and returns mismatches. |
check.column.order |
Enforce same column order. |
sort.by.id |
Sort by the id column before making comparisons. |
acceptable.pct.rows.diff |
If you are OK with differences in a few rows, set this value. If fewer rows in a column don't match, the function will consider the columns equivalent. Iterpreted as a percentage (it gets divided by 100). |
acceptable.pct.vals.diff |
If you are OK with small differences in values, set this value. If the difference in numeric values is less, the function will consider the values equivalent. Iterpreted as a percentage (it gets divided by 100) and compared to absolute value of percentage difference. |
return.summary |
Return 2 items in a list, the row mismatches and a summary of row mismatches. |
verbose |
Print helpful information via cat(). |
May return information about mismatches. Otherwise doesn't return anything (NULL).
validate.equal( iris, iris )
validate.equal( iris, iris )
Improved write function. Writes to csv without row names and automatically adds .csv to the file name if it isn't there already. Changes to .csv if another extension is passed. Easier to type than write.csv(row.names=F). Author: Bryce Chamberlain. Tech reveiw: Maria Gonzalez.
w(x, filename = "out", row.names = FALSE, na = "")
w(x, filename = "out", row.names = FALSE, na = "")
x |
Data frame to write to file. |
filename |
(Optional) Filename to use. |
row.names |
(Optional) Specify if you want to include row names/numbers in the output file. |
na |
(Optional) String to print for NAs. Defaults to an empty/blank string. |
# write the cars dataset. path = paste0( tempdir(), '/out.csv' ) w( cars, path ) # cleanup. file.remove( path )
# write the cars dataset. path = paste0( tempdir(), '/out.csv' ) w( cars, path ) # cleanup. file.remove( path )
Converts dates formatted as long integers from Excel to Date format in R, accounting for known Excel leap year errors. Author: Bryce Chamberlain. Tech review: Dominic Dillingham.
xldate( x, origin = "1899-12-30", nastrings = easyr::nastrings, preprocessed.values = NULL, ifna = c("return-unchanged", "error", "warning", "return-na"), verbose = TRUE, allow_times = FALSE, do.month.char = TRUE, min.acceptable = lubridate::ymd("1920-01-01"), max.acceptable = lubridate::ymd("2050-01-01") )
xldate( x, origin = "1899-12-30", nastrings = easyr::nastrings, preprocessed.values = NULL, ifna = c("return-unchanged", "error", "warning", "return-na"), verbose = TRUE, allow_times = FALSE, do.month.char = TRUE, min.acceptable = lubridate::ymd("1920-01-01"), max.acceptable = lubridate::ymd("2050-01-01") )
x |
Vector of values. |
origin |
Zero value to use in date conversion. Older version of excel might use a different value. |
nastrings |
Vector of characters to be considered NAs. todate will treat these like NAs. Defaults to the easyr::nastrings list. |
preprocessed.values |
Strings need to have NAs set, lowercase and be trimmed before they can be checked. To avoid doing this twice, you can tell the function that it has already been done. |
ifna |
Action to take if NAs are created. 'return-unchanged' returns the sent vector unchanged; 'warning' results in a warning and returns the converted vector with new NAs; 'error' results in an error. |
verbose |
Choose to view messaging. |
allow_times |
Return values with time, not just the date. |
do.month.char |
Convert month character names like Feb, March, etc. |
min.acceptable |
Set NA if converted value is less than this value. Helps to prevent numbers from being assumed as dates. Set NULL to skip this check. |
max.acceptable |
Set NA if converted value is greater than this value. Helps to prevent numbers from being assumed as dates. Set NULL to skip this check. |
Vector of converted values.
xldate( c('7597', '42769', '47545', NA ) )
xldate( c('7597', '42769', '47545', NA ) )
Date Difference in Years
ydiff(x, y, do.date.convert = TRUE, do.numeric = TRUE)
ydiff(x, y, do.date.convert = TRUE, do.numeric = TRUE)
x |
Vector of starting dates or items that can be converted to dates by todate. |
y |
Vector of ending dates or items that can be converted to dates by todate. |
do.date.convert |
Convert to dates before running the difference. If you know your columns are already dates, setting to FALSE will make your code run faster. |
do.numeric |
Convert the output to a number instead of a date difference object. |
Vector of differences.
ydiff( lubridate::mdy( '1/1/2018' ), lubridate::mdy( '3/4/2018' ) )
ydiff( lubridate::mdy( '1/1/2018' ), lubridate::mdy( '3/4/2018' ) )