Unwanted items on a plot
I have a problem while using a subset of a data.frame
in R.
The subset gets created and displayed correctly, but when I try to plot it using qplot()
, the rows which were not selected by the subset()
also get shown along one axis.
The actual file I am reading in is a web server log, but I h开发者_JS百科ave created a small example to illustrate my problem:
This is the ITEMSSOLD.CSV
file I read in:
CUST,DT,ITEM,PRICE
BigJoe,10/13/2010,Pickup Truck,20000
TightWad,10/13/2010,USB Drive,12
Jane,10/13/2010,Smart Car,30000
Scrooge,10/13/2010,Gumdrops,1
GeekyMan,10/13/2010,Smart Car,30000
I read this into a data frame as follows:
sales_df <- read.table("C:/R_Expt/ItemsSold.csv", header=TRUE, sep=",")
I then did a subset to get the high-ticket items as follows:
big_sales_df <- subset(sales_df, PRICE>100)
big_sales_df
big_sales_df
CUST DT ITEM PRICE
1 BigJoe 10/13/2010 Pickup Truck 20000
3 Jane 10/13/2010 Smart Car 30000
5 GeekyMan 10/13/2010 Smart Car 30000
So that looks OK.
When I try to plot it via qplot
as follows:
qplot(nrow, ITEM, data = ddply(big_sales_df, .(ITEM), "nrow"))
the resulting plot shows all the ITEMS on the Y axis, instead of only Pickup Truck and Smart Car.
The ddply()
alone produces the following output:
ddply(big_sales_df, .(ITEM), "nrow")
ITEM nrow
1 Pickup Truck 1
2 Smart Car 2
Since the example has a small number of ITEMs, the plot is still readable, but in real life, I am trying to plot the names of slow web pages and unfortunately, qplot()
tries to put the names of all webpages along the Y axis and it becomes a black blur.
I tried sqldf()
also:
qplot(NSOLD, ITEM, data = sqldf('select ITEM, count(*) as NSOLD from big_sales_df group by ITEM order by count(*) desc'))
but this produces the same plot.
What I understood is that subset()
is somehow carrying the full parent information inside, instead of just the matched rows.
Is there any way of telling subset()
that it should only keep the relevant information?
Or any other way of getting around subset()
carrying along the empty members?
I know that a brute force method might be to write the result of the subset()
into another CSV file and then read it back in a data.frame
, but I am sure there is an easier way.
Many thanks to all your R gurus out there!
Atish
You need to remove the factor levels that were dropped from your subset.
big_sales_df$ITEM <- factor(big_sales_df$ITEM)
big_sales_df$CUST <- factor(big_sales_df$CUST)
OR change how you read in the data:
sales_df <- read.csv("ItemsSold.csv", header=TRUE, stringsAsFactors=FALSE)
When you subset your data the factors that existed in the original data set persist. Take the diamonds data set for example. You have 5 different cuts.
unique(diamonds$cut) ## Ideal, Premium, Good, Very Good, Fair
If we subset diamonds, we get:
str(subset(diamonds, cut == "Ideal")) ## Look at structure
In str(), we see that cut maintains the factors that it had originally.
$ cut : Factor w/ 5 levels "Fair","Good",..: 5 5 5 5 5 5 5 5 5 5 ...
Even though we've removed all the other categories of cut, the factoring persists.
You can remove the extra factors by refactoring the column with it's own unique subsetted factors.
x$cut <- factor(x$cut, labels=unique(x$cut))
Now looking more specifically at your example:
test <- ddply(big_sales_df, .(ITEM), "nrow")
test$ITEM <- factor(test$ITEM, labels=unique(test$ITEM))
Now, try your plot again.
Or you can cheat by factoring item:
qplot(nrow, factor(ITEM), data = ddply(big_sales_df, .(ITEM), "nrow")
精彩评论