R programming is a big-fat calculator in layman terms. But R is not the jack of all trades when it comes to handling voluminous data like, Bigdata. But ever wondered how much is ‘Big’ in Bigdata?
When organizations have plethora of data (High volume, may be due to the variety of secondary data sources) and business managers are strangled with it, the analysis becomes challenging as it requires a lot of data engineering to go with it, which in other words is Data Preprocessing. Also, what gets more difficult when data is BIG? Here’s what can happen:
- The data may not load / completely load into memory
- Analyzing the data may take a long time
- Visualizations get messy, resulting in poor interpretability of the models / algorithms
So, as a best practice, it is good to know how much data can your system load or handle. We can use ‘R’ to interpret the data better.
R sets a limit on the most memory it will allocate from the OS. The R function memory.limit() should pull up the allowable memory limit for data processing. With large datasets, R loads all data into memory by default. But on the other hand, SAS allocated memory dynamically to keep data on disk by default. But the result of this rat race is, SAS handles very large datasets better. To optimize this shortfall, we can change the limit in R.
We can use the function memory.size() to change R’s allocation limit. But they are dependent on the local machine configuration. Usually, it is 2 or 3 GB if R runs on a 32-bit OS and you still shouldn’t load huge datasets into memory or virtual memory, swapping etc., 2 GB of memory used by R is not as same as 2 GB on disk owing to the overhead for R to keep track of Data, memory used for analysis (major chunk).
So, what can be done?
Suppose you have too much data, there are few things that you can do:
- Make the data smaller
- Get a bigger computer
- Access the data differently
- Split up the dataset for analysis
Option 1: Making the data smaller
The best initial option is to often ensure you really need to deal with the problem. Run the analysis on the slice of your data. You may get all you need in order to move forward. But, the challenge is, can you slice 500 random rows from the data set and try computing an odds ratio? (Generalized Linear Model or Logistic Regression for instance).
Code:
rows_to_select = sample(1:nrow(dataset), 500, replace=F)
dataset_sample = dataset[rows_to_select,]
oddsratio(dataset_sample$has_plan, as.factor(dataset_sample$x.variable1))
So, start with smaller data. If your data comes from a database, you may be issuing a SQL query directly from R to get just the subset of data you would want. The package ‘RMySQL’ does this job if this scenario is appropriate for you.
Option 2: Data Table rather than a Data Frame
The package data.table() allows some optimizations to data frame but slightly a different syntax.
dataset_new=data.table(dataset)
object.size(dataset_new)
object.size(dataset)
Option 3: Get a bigger computer
You may be lucky or not to have a bigger processor. Get some temporary space, use one machine on high-performance cluster or rent some cloud computing time. This solves the problem in a jiffy but a large investment is required. Many people do not opt for this owing to costs.
Option 4: Split it Up
For example, with a big database, ask for 200 MB of records at a time and analyze each. Then, combine the result. But mind you, this is analogous to only mapreduce, Split-Apply-Combine. We can use computing clusters to parallelize analysis.
If it is just too slow in analysis because of lack of memory on the local disk or there’s a lot of work to do because data is big and you’re doing it over and over. Depending on the analysis, you may be able to create a new dataset with just the subset you need, then remove the larger dataset from your memory space:
rows=[1:500]
columns=[1:30]
subset=bigdata[rows,columns]
rm(bigdata)
Profiling can also be a saviour in this case which would resolve some of the performance issues which we might not anticipate like:
- Some modeling code defaults to bootstrapping confidence intervals that you don’t care about with 1000 iterations per model
- You accidentally wrote the code so that it does some slow operation for every column in your 5000-column dataset, then selects the column you want rather than does the operation only on the one you care about
- You do something for every line of your huge data frame and then combine results using c() or rbind() rather than assigning to a pre allocated vector or matrix
We will discuss about “Profiling” in the upcoming blog.