Monday Feb 10, 2014

Finding datasets for South African Heart Disease

I've been working with R recently and have been looking for interesting data sets, for example:

http://statweb.stanford.edu/~tibs/ElemStatLearn/data.html

The following command is documented but fails because R can't track through 2 302 status codes and a "301 Moved Permanently":

> read.table("http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data", sep=",",head=T,row.names=1)
Error in file(file, "rt") : cannot open the connection


This worked:

> read.table("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data", sep=",",head=T,row.names=1)

Even better:

http://cran.r-project.org/web/packages/ElemStatLearn/ElemStatLearn.pdf

> install.packages("ElemStatLearn")
> help(package="ElemStatLearn")
> library( package = ElemStatLearn )
> str(SAheart)
> summary(SAheart)
> data(SAheart)

R attributes and regexpr

I've been working with R recently.

Here is an example of using the match.length attribute that is returned from regexpr:

Three strings in a vector, first and third include an embedded string:

data=c("<a href=\"ch4.html\">Chapter 1</a>",
       "no quoted string is embedded in this string",
       "<a   href=\"appendix.html\">Appendix</a>")

Use regexpr to locate the embedded strings:

> locations <- regexpr("\"(.*?)\"", data)

Matches are in the first string (at 9 with length 10) and third string (at 11 with length 15):

> locations
[1]  9 -1 11
attr(,"match.length")
[1] 10 -1 15
attr(,"useBytes")
[1] TRUE

Vector from the attribute:

> attr(locations,"match.length")
[1] 10 -1 15

Use substr and the attribute vector to extract the strings:

> quoted_strings=substr( data, 
                         locations, 
                         locations+attr(locations,"match.length")-1 )    
> quoted_strings
[1] "\"ch4.html\""      ""                  "\"appendix.html\""

Maybe you'd like to remove the embedded quote characters from your strings:

> gsub("\"", "", quoted_strings)
[1] "ch4.html"      ""              "appendix.html"

An alternative is to use regmatches:

> regmatches(data,locations)
[1] "\"ch4.html\""      "\"appendix.html\""
About

user12620111

Search

Archives
« February 2014 »
SunMonTueWedThuFriSat
      
1
2
3
4
5
6
7
8
9
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
27
28
 
       
Today