diff --git a/CommonMaterial/FrontMatter.Rnw b/CommonMaterial/FrontMatter.Rnw new file mode 100644 index 0000000..9f22753 --- /dev/null +++ b/CommonMaterial/FrontMatter.Rnw @@ -0,0 +1,265 @@ +<>= +opts_chunk$set( fig.path="figures/FrontMatter-" ) +set_parent('Master-Starting.Rnw') +set.seed(123) +@ + + +\chapter*{About These Notes} + + +We present an approach to teaching introductory and intermediate +statistics courses that is tightly coupled with computing generally and with \R\ and \RStudio\ in particular. These activities and examples are intended to highlight a modern approach to statistical education that focuses on modeling, resampling based inference, and multivariate graphical techniques. A secondary goal is to +facilitate computing with data through use of small simulation studies %data scraping from the internet +and appropriate statistical analysis workflow. This follows the +philosophy outlined by Nolan and Temple Lang\cite{nola:temp:2010}. The importance of modern computation\marginnote{$\ $} in statistics education is a principal component of the recently adopted American Statistical Association's curriculum guidelines\cite{ASAcurriculum2014}. + +Throughout this book (and its companion volumes), we +introduce multiple activities, some +appropriate for an introductory course, others suitable for higher levels, that +demonstrate key concepts in statistics and modeling +while also supporting the core material of more traditional courses. + +\subsection*{A Work in Progress} + +\Caution{Despite our best efforts, you WILL find bugs both in this document and in our code. +Please let us know when you encounter them so we can call in the exterminators.}% + +These materials were developed for a workshop entitled +\emph{Teaching Statistics Using R} prior to the 2011 United States Conference +on Teaching Statistics and revised for USCOTS 2011, USCOTS 2013, eCOTS 2014, ICOTS 9, and USCOTS 2015. +We organized these workshops to help instructors integrate \R\ (as well as some related technologies) into statistics courses at all levels. +We received great feedback and many wonderful ideas from the participants and those that we've shared this with since the workshops. + +Consider these notes to be a work in progress. +%\SuggestionBox{Sometimes we will mark +%places where we would especially like feedback with one of these suggestion boxes. +%But we won't do that everywhere we want feedback or there won't be room for +%anything else.}% +We appreciate any feedback you are willing to share as we continue +to work on these materials and the accompanying \pkg{mosaic} package. +Drop us an email at \url{pis@mosaic-web.org} with any comments, suggestions, +corrections, etc. + +Updated versions will be posted at \url{http://mosaic-web.org}. + + +\subsection*{Two Audiences} + +We initially developed these materials for +instructors of statistics at the college or +university level. Another audience is the students these instructors teach. +Some of the sections, examples, and exercises are written with one or the other of +these audiences more clearly at the forefront. This means that +\begin{enumerate} +\item Some of the materials can be used essentially as is with students. +\item Some of the materials aim to equip instructors to develop their own +expertise in \R\ and \RStudio\ to develop their own teaching materials. +\end{enumerate} + +Although the distinction can get blurry, and what works ``as is" in one setting may +not work ``as is" in another, we'll try to indicate which parts +fit into each category as we go along. + +\subsection*{R, RStudio and R Packages} + +\R\ can be obtained from \url{http://cran.r-project.org/}. +Download and installation are quite straightforward for Mac, PC, or linux machines. + +\RStudio\ is an integrated development environment (IDE) that facilitates use of \R\ for both novice and expert users. We have adopted it as our standard teaching environment because it dramatically simplifies the use of \R\ for instructors and for students.% +\Pointer[-3cm]{Several things we use that can be done only in \RStudio, for instance \function{manipulate} or \RStudio's integrated support for reproducible research).}% +%\RStudio\ is available from \url{http://www.rstudio.org/}. +\RStudio\ can be installed as a desktop (laptop) application or as a server application that is accessible to users via the Internet.\FoodForThought[-.5cm]{RStudio server version works well with starting students. All they need is a web browser, avoiding any potential problems with oddities of students' individual computers.} + +In addition to \R\ and \RStudio, we will make use of several packages that need to be installed and loaded separately. The \pkg{mosaic} package (and its dependencies) will be used throughout. Other packages appear from time to time as well. + + +%\subsection*{Notation} +% +%%\newthought{Exercises} +%Exercises marked with 1 star are intended for students in courses beyond the +%introductory level. Exercises marked with 2 stars are intended primarily for +%instructors (but may also be appropriate for students in higher level courses). + +\subsection*{Marginal Notes} +Marginal notes appear here and there. +%\DiggingDeeper{Some marginal notes will look like this one and provide +%some additional information that you may find of interest.}% +\marginnote{Have a great suggestion for a marginal note? Pass it along.}% +Sometimes these are side comments that we wanted to say, but we didn't want to interrupt the flow to mention them in the main text. Others provide teaching tips or caution about traps, pitfalls and gotchas. +%\Caution{But warnings are set differently to make sure they catch your attention.}% +%These may describe more advanced features of the language or make suggestions +%about how to implement things in the classroom. Some are warnings +%to help you avoid common pitfalls. Still others contain requests for feedback. +%\SuggestionBox{So, do you like having marginal notes in these +%notes?} + + +\subsection*{What's Ours Is Yours -- To a Point} + +This material is copyrighted by the authors under a Creative Commons Attribution 3.0 +Unported License. +You are free to \emph{Share} (to copy, distribute and transmit the work) and to \emph{Remix} +(to adapt the work) if you attribute our work. +More detailed information about the licensing is available at this web page: +\url{http://www.mosaic-web.org/go/teachingRlicense.html}. + + + +\DiggingDeeper{If you know \LaTeX\ as well as \R, then \pkg{knitr} +provides a nice solution for mixing the two. +We used this system to produce this book. We also use it +for our own research and to introduce upper level students to +reproducible analysis methods. +For beginners, we introduce \pkg{knitr} with RMarkdown, +which produces PDF, HTML, or Word files using a simpler syntax.} + +\subsection*{Document Creation} + +This document was created on \today, using +\begin{itemize} +\item \pkg{knitr}, version \Sexpr{packageVersion("knitr")} +\item \pkg{mosaic}, version \Sexpr{packageVersion("mosaic")} +\item \pkg{mosaicData}, version \Sexpr{packageVersion("mosaic")} +\item \Sexpr{R.version.string} +\end{itemize} + +Inevitably, each of these will be updated from time to time. +If you find that things look different on your computer, make sure that your +version of \R{} and your packages are up to date and check for a newer version +of this document. + +Kudos to Joseph Cappelleri for many useful comments on earlier drafts of these materials and to Margaret Chien for her work updating the examples to ggformula. + + + +\chapter*{Project MOSAIC} + +This book is a product of +Project MOSAIC, a community of educators working to develop new ways to +introduce mathematics, statistics, computation, and modeling to students in +colleges and universities. + +\bigskip + +The goal of the MOSAIC project is to help share ideas and resources to +improve teaching, and to develop a curricular and assessment +infrastructure to support the dissemination and evaluation of these approaches. +Our goal is to provide a broader approach to quantitative studies that provides +better support for work in science and technology. +The project highlights and integrates +diverse aspects of quantitative work that students in science, +technology, and engineering will need in their professional lives, but which +are today usually taught in isolation, if at all. + +\vspace{.1in} + +In particular, we focus on: +\begin{description} + \item[Modeling] The ability to create, manipulate and investigate useful and informative mathematical representations of a real-world situations. + + \item[Statistics] The analysis of variability that draws on our ability to quantify uncertainty and to draw logical inferences from observations and experiment. + + \item[Computation] + The capacity to think algorithmically, to manage data on large scales, to visualize and interact with models, and to automate tasks for efficiency, accuracy, and reproducibility. + + \item[Calculus] + The traditional mathematical entry point for college and university students and a subject that still has the potential to provide important insights to today's students. + \end{description} + +Drawing on support from the US National Science Foundation (NSF DUE-0920350), +Project MOSAIC supports a number of initiatives to help achieve these goals, +including: +\begin{description} +\item +[Faculty development and training opportunities,] +such as the USCOTS 2011, USCOTS 2013, eCOTS 2014, eCOTS 2016, eCOTS 2018, and ICOTS 9 workshops on +\emph{Teaching Statistics Using \R\ and \RStudio}, our 2010 +Project MOSAIC kickoff workshop at the Institute for Mathematics +and its Applications, and our \emph{Modeling: Early and Often in Undergraduate Calculus} +AMS PREP workshops offered in 2012, 2013, and 2015. + +\item +[M-casts,] +a series of regularly scheduled webinars, delivered via the Internet, +that provide a forum for instructors to share their insights and innovations +and to develop collaborations to refine and develop them. +Recordings of M-casts are available +at the Project MOSAIC web site, \url{http://mosaic-web.org}. + +%\item[The development of a ``concept inventory" to support teaching modeling.] +%It is somewhat rare in today's curriculum for modeling to be taught. +%College and university catalogs are filled with descriptions of courses +%in statistics, computation, and calculus. There are many textbooks in +%these areas and most new faculty teaching statistics, computation, +%and calculus have a solid idea of what should be included. +%But modeling is different. It's generally recognized +%as important, but few if instructors have a clear view of the essential +%concepts. + +\item[The construction of syllabi and materials] +for courses that teach MOSAIC topics in a better integrated way. Such +courses and materials might be wholly new constructions, or they might be +incremental modifications of existing resources that draw on the +connections between the MOSAIC topics. +\end{description} + +More details can be found at \url{http://www.mosaic-web.org}. +We welcome and encourage your participation in all of these initiatives. + + + +\chapter*{Computational Statistics} + +There are at least two ways in which statistical software +can be introduced into a statistics course. In the first approach, the course +is taught essentially as it was before the introduction of statistical +software, but using a computer to speed up some of the calculations and +to prepare higher quality graphical displays. Perhaps the size of the +data sets will also be increased. We will refer to this approach as +\term{statistical computation} +since the computer serves primarily as a computational +tool to replace pencil-and-paper calculations and drawing plots manually. + +In the second approach, more fundamental changes in the course result from the introduction of the computer. Some new topics are covered, some old topics are omitted. Some old topics are treated in very different ways, and perhaps at different points in the course. We will refer to this approach as \term{computational statistics} because the availability of computation is shaping how statistics is done and taught. Computational statistics is a key component of \term{data science}, defined as the ability to use data to answer questions and communicate those results. + +\FoodForThought{Students need to see aspects of computation and data science early and often +to develop deeper skills. Establishing precursors in introductory courses help them get started.}% +In practice, most courses will incorporate elements of both +statistical computation and computational statistics, but the relative +proportions may differ dramatically from course to course. +Where on the spectrum a course lies will depend +on many factors including +the goals of the course, +the availability of technology for student use, +the perspective of the text book used, +and the comfort-level +of the instructor with both statistics and computation. + + +Among the various statistical software packages available, \R\ is becoming +increasingly popular. The recent addition of \RStudio\ has made \R\ both +more powerful and more accessible. +Because \R\ and \RStudio\ are free, they +have become widely +used in research and industry. Training in \R\ and \RStudio\ is often seen as an +important additional skill that a statistics course can develop. Furthermore, +an increasing number of instructors are using \R\ for their own statistical +work, so it is natural for them to use it in their teaching as well. +At the same time, the development of \R\ and of \RStudio\ (an optional +interface and integrated development environment for \R) are making it +easier and easier to get started with \R. + +%Nevertheless, those who are unfamiliar with \R\ or who have never used \R\ for teaching are understandably cautious about using it with students. If you are in that category, then this book is for you. Our goal is to reveal some of what we have learned teaching with \R\ and to make teaching statistics with \R\ as rewarding and easy as possible -- for both students and faculty. We will cover both technical aspects of \R\ and \RStudio\ (e.g., how do I get \R\ to do thus and such?) as well as some perspectives on how to use computation to teach statistics. The latter will be illustrated in \R\ but would be equally applicable with other statistical software. + +%Others have used \R\ in their courses, but have perhaps left the course feeling +%like there must have been better ways to do this or that topic. If that +%sounds more like you, then this book is for you, too. As we have been working +%on this book, we have also been developing the \pkg{mosaic} + +\FoodForThought{Information about the \pkg{mosaic} package, including vignettes demonstrating features and supplementary materials (such as this book) can be found at \url{https://cran.r-project.org/web/packages/mosaic}.} +We developed the \pkg{mosaic} +\R\ package (available on CRAN) to make certain aspects of statistical +computation and computational statistics simpler for beginners, without limiting their ability to +use more advanced features of the language. The \pkg{mosaic} package includes a modelling approach that uses the same general syntax to calculate descriptive statistics, create graphics, and fit linear models. + diff --git a/Compendium/Cover/frontice.docx b/Compendium/Cover/frontice.docx deleted file mode 100644 index 62d87a3..0000000 Binary files a/Compendium/Cover/frontice.docx and /dev/null differ diff --git a/Compendium/Cover/frontice.pdf b/Compendium/Cover/frontice.pdf deleted file mode 100644 index de1dda6..0000000 Binary files a/Compendium/Cover/frontice.pdf and /dev/null differ diff --git a/CoverImages/.DS_Store b/CoverImages/.DS_Store deleted file mode 100644 index 5008ddf..0000000 Binary files a/CoverImages/.DS_Store and /dev/null differ diff --git a/CoverImages/ISBN-9780983965831-StudentGuide.pdf b/CoverImages/ISBN-9780983965831-StudentGuide.pdf new file mode 100644 index 0000000..1524cdf Binary files /dev/null and b/CoverImages/ISBN-9780983965831-StudentGuide.pdf differ diff --git a/Functions/.gitignore b/Functions/.gitignore deleted file mode 100644 index db89344..0000000 --- a/Functions/.gitignore +++ /dev/null @@ -1,10 +0,0 @@ -*.aux -*.bbl -*.blg -*.log -*.notes -*.synctex.gz -*-concordance.tex -*.pdf -*.tex -figure diff --git a/Functions/CalculusForStats.Rnw b/Functions/CalculusForStats.Rnw deleted file mode 100644 index 32de343..0000000 --- a/Functions/CalculusForStats.Rnw +++ /dev/null @@ -1,142 +0,0 @@ -<>= -source('../include/setup.R') -opts_chunk$set( fig.path="figure/CalculusForStats-fig-" ) -if (!exists("standAlone")) set_parent('../include/MainDocument.Rnw') -set.seed(123) -@ - - - - -\chapter{Making Connections to Calculus and Linear Algebra} - -The traditional introductory statistics course does not presume that students know any calculus or linear -algebra. But for courses where students have had (or will be learning) these mathematical tools, \R\ and \RStudio\ -provide the necessary computational tools. - -\section{Calculus in R} - -Most people familiar with \R\ assume that it provides no facilities for -the central operations of calculus: differentiation and integration. -While it is true that \R\ does not provide a comprehensive computer algebra system -\authNote{Should we site the bit of symbolic stuff \R\ does have?}% -capable of symbolic manipulations on a wide variety of functions and expressions, -it does provide most of the calculus tools needed for work in statistics, including - -\begin{itemize} - \item A way to define and evaluate functions, - especially functions of two or more variables. [\function{function}] - \item An ability to visualize functional relationships graphically. [\function{plotFun}] - \item A modeling strategy: how to approximate potentially complex - relationships with simple-to-understand ones. - \item A way to (numerically) calculate derivatives (as functions), especially partial derivatives, to understand - relationships and partial relationships. [\function{D}] - \item - A way to perform (numerical) integration. [\function{antiD}, \function{integrate}] - \item - A way to estimate the roots and extrema of functions. - [\function{uniroot}, \function{nlmax}, \function{nlmin}] -\end{itemize} - - -%What they don't need: -%\begin{itemize} -% \item Limits and formal definitions of continuity. -% \item Most symbolic algorithms, e.g. elaborate applications of the -% chain rule in differentiation or almost any symbolic integration. -%\end{itemize} - -%In fact, one can argue that this is the situation in calculus courses as well as in statistics courses. -%As evidence that this is not an eccentric view, we refer you to the -%MAA report on ``Curriculum Renewal across the First Two Years,'' which -%examined the relationship between mathematics and more than 20 -%``partner'' disciplines: ranging from physics to engineering to business to -%biology.\cite{MAA-CRAFTY} - - -In this chapter we illustrate how to perform the needed operations listed above -in the context of statistical applications. - - -\subsection{Defining Functions} -Functions that are described by algebraic formulas are generally easy to describe in \R. For example, -$f(x) = x (1-x)^2$ becomes -<>= -f <- function(x) { x * (1-x)^2 } -@ -New functions can be evaluated at one or several points just like built-in \R\ functions: -<<>>= -f(1) -f(-2:2) -@ -and can be plotted using \function{plotFun}. -<>= -plotFun(f(x) ~ x, xlim=range(-1,2)) -@ - -\subsection{Differentiation and Integration} - -Perhaps we would like to use our function $f$ as the kernel of a distribution on in the interval -$[0,1]$. First we determine the scaling constant involved. We can use either \function{integrate} -from the \pkg{stats} package or \function{antiD} from the \pkg{mosaic} pacakge. -<<>>= -integrate(f, 0, 1) -integrate(f, 0, 1)$value # just the value -F <- antiD(f, from=0) # returns a function -F(1) # evaluate at 1 -@ -The \function{fractions} function in the \pkg{MASS} package can help identify whether -decimals are approximating simple fractions. -<<>>= -fractions(integrate(f, 0, 1)$value) -fractions(F(1)) -@ -% This shows that the scaling constant is $\ Sexpr{1/F(1)}$. -Let's redefine $f$ so that it is a pdf: -\authNote{rjp: removed inline S expression -- bug in knitr?}% -<<>>= -f <- function(x) { 12 * x * (1-x)^2 * (x >= 0) * (x <= 1) } -plotFun(f(x) ~ x, xlim=c(-0.5,1.5), ylim = c(-.5,2.5)) -@ - - -\InstructorNote{There are different styles for dealing with parameters.} - - -Many important functions do not have integrals that can be expressed -in terms of a simple formula. For instance, the normal pdf -$$ f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left( \frac{(x - \mu)^2}{2 \sigma^2}\right) .$$ - -We could, of course, write this out as the equivalent formula in \R, -but the function is so important it already has a name in \R: \function{dnorm}. - -Let's integrate it, setting $\mu=3$ and $\sigma=1.5$: -<<>>= -f = function(x){dnorm(x, mean=3, sd=1.5) } -@ -This $f$ is a particular member of the family of normal -distributions. Here is its cumulative function: -<>= -fcumulative = antiD(f,-Inf) -curve(fcumulative, -2,10, lwd=4) # by integration -curve( pnorm(x, mean=3, sd=1.5), add=TRUE, col="red") # the built-in -@ - -There's little point in computing this integral, however, except to -show that it matches the built-in \function{pnorm} function. - -One of the advantages to teaching integration and differentiation in a -way that doesn't depend on constructing formulas, is that you can use -functions that don't have simple formulas for modeling. For example, -you can use functions generated by splining through data points. - -\authNote{Show a spline example.} - - -%\section{Discrete} - -\section{Linear Algebra} - -Showing how to introduce Linear Algebra by taking linear combinations -of functions to fit a set of data. - diff --git a/Functions/Master/Master-Functions.Rnw b/Functions/Master/Master-Functions.Rnw deleted file mode 100644 index 1cbb704..0000000 --- a/Functions/Master/Master-Functions.Rnw +++ /dev/null @@ -1,13 +0,0 @@ -% All pre-amble stuff should go into ../include/MainDocument.Rnw -\title{Formulas and Functions} -\author{Randall Pruim and Nicholas Horton and Daniel Kaplan} -\date{DRAFT: \today} -\Sexpr{set_parent('../../include/MainDocument.Rnw')} % All the latex pre-amble for the book -\maketitle - -\tableofcontents - -\newpage - -\import{../}{CalculusForStats} - diff --git a/Functions/Outline-Functions.Rmd b/Functions/Outline-Functions.Rmd deleted file mode 100644 index 118b98a..0000000 --- a/Functions/Outline-Functions.Rmd +++ /dev/null @@ -1 +0,0 @@ -## Functions and Formulas: Outline diff --git a/ICOTS/Workshop2013/abstract.txt b/ICOTS/Workshop2013/abstract.txt deleted file mode 100644 index 303d545..0000000 --- a/ICOTS/Workshop2013/abstract.txt +++ /dev/null @@ -1,14 +0,0 @@ -Teaching Statistics with R and RStudio - -This workshop will introduce participants to teaching applied statistics -courses using computing in an integrated way. The presenters have been using R -to teach statistics to undergraduates at all levels for the last decade and -will share their approach and some of their favorite examples. Topics will -include workflow in the RStudio environment, providing novices with a powerful -but manageable set of tools, data visualization, resampling and randomization -methods, and how to emphasize modeling throughout the curriculum. Much of this -will be facilitated using the mosaic package. - -The workshop is designed to be accessible to those with little or no experience teaching -with R, and will provide participants with skills, examples, and resources that -they can use in their own teaching. diff --git a/Internet/.gitignore b/Internet/.gitignore deleted file mode 100644 index db89344..0000000 --- a/Internet/.gitignore +++ /dev/null @@ -1,10 +0,0 @@ -*.aux -*.bbl -*.blg -*.log -*.notes -*.synctex.gz -*-concordance.tex -*.pdf -*.tex -figure diff --git a/Internet/Internet.Rnw b/Internet/Internet.Rnw deleted file mode 100644 index 4603961..0000000 --- a/Internet/Internet.Rnw +++ /dev/null @@ -1,785 +0,0 @@ -<>= -source('../include/setup.R') -opts_chunk$set( fig.path="figure/Internet-fig-" ) -if (!exists("standAlone")) set_parent('../include/MainDocument.Rnw') -set.seed(123) -@ - - -\chapter{Taking Advantage of the Internet} - -The Internet provides a wealth of data spanning the world, access -to sophisticated statistical computing, and a practical means for -you to communicate with your own students. In this chapter, we'll -illustrate some mundane ways for you to distribute and share data and -software with your students, web-based interfaces for statistical -computing, as well as tools for ``scraping'' data -from the Internet using application program interfaces (API's) or -through XML (eXtensible Markup Language). - -We draw your attention particularly to -provocative papers by Gould \cite{Goul:2010} (about the importance of broadening the type of data and questions which students encounter -in their first courses) -and Nolan and Temple Lang -\cite{nola:temp:2010} (highlighting -the increasingly important role of computing in modern statistics). -\FoodForThought{The wealth of data accessible to students on the internet -continues to increase at what feels like an exponential rate.} - -\section{Sharing With and Among Your Students} -\label{sec:distributing-data} - -Instructors often have their own data sets to illustrate -points of statistical interest or to make a particular connection with -a class. Sometimes you may want your class as a whole to construct a -data set, perhaps by filling in a survey or by contributing -their own small bit of data to a class collection. Students may be -working on projects in small groups; it's nice to have tools to -support such work so that all members of the group have access to the -data and can contribute to a written report. - -There are now many technologies for supporting such sharing. For the -sake of simplicity, we will emphasize three that we have found -particularly useful both in teaching statistics and in our -professional collaborative work. These are: -\begin{itemize} -\item A web site with minimal overhead, such as provided by Dropbox. -\item The services of Google Docs. -\item A web-based \RStudio\ server for \R. -\end{itemize} -The first two are already widely used in university environments and -are readily accessible simply by setting up accounts. Setting up an -\RStudio\ web server requires some IT support, but is well within the -range of skills found in IT offices and even among some individual faculty. - -\subsection{Your Own Web Site} - -You may already have a web site. We have in mind a place where you -can place files and have them accessed directly from the Internet. -For sharing data, it's best if this site is public, that is, it does not require a login. -That rules out most ``course support'' systems such as Moodle or -Blackboard. -\FoodForThought{Our discussion of Dropbox is primarily for those who do -not already know how to do this other ways.}% - -The Dropbox service for storing files in the ``cloud'' provides a very -convenient way to distribute files over the web. (Go to -\texttt{dropbox.com} for information and to sign up for a free account.) -Dropbox is routinely used to provide automated backup and coordinated -file access on multiple computers. But the Dropbox service also -provides a {\sc Public} directory. Any files that you place in that -directory can be accessed directly by a URL. - -To illustrate, suppose you wish to share some data set with your -students. You've constructed this data set in a spreadsheet and -stored it as a CSV file, let's call it ``example-A.csv''. Move this -file into the {\sc Public} directory under Dropbox --- on most -computers Dropbox arranges things so that its directories appear -exactly like ordinary directories and you'll use the ordinary familiar -file management techniques as in Figure \ref{fig:dropbox1}. -\begin{figure} -\begin{center} -\includegraphics[width=3.5in]{images/dropbox1.png} -\end{center} -\caption{\label{fig:dropbox1} Dragging a CSV file to a Dropbox Public directory} -\end{figure} - -Dropbox also makes it straightforward to construct the web-location -identifying URL for any file by using mouse-based menu commands to -place the URL into the clipboard, whence it can be copied to your -course-support software system or any other place for distribution to -students. For a CSV file, reading the contents of the file into \R\ -can be done with the \function{read.csv} function, by giving it the -quoted URL: -<>= -a <- read.csv("http://dl.dropbox.com/u/5098197/USCOTS2011/ExampleA.csv") -@ -\InstructorNote{The history feature in \RStudio\ can be used to -re-run this command in future sessions} - -\begin{figure} -\begin{center} -\includegraphics[width=4.5in]{images/dropbox2.png} -\end{center} -\caption{\label{fig:dropbox2}Getting the URL of a file in a Dropbox Public directory} -\end{figure} - -This technique makes it easy to distribute data with little -advance preparation. It's fast enough to do in the middle of a -class: the CSV file is available to your students (after a brief lag -while Dropbox synchronizes). -It can even be edited by you (but not by your students). - -The same technique can be applied to all sorts of files: for example, -\R\ workspaces or even \R\ scripts. Of course, your students need to -use the appropriate \R\ command: \function{load()} for a workspace or -\function{source()} for a script. - -Many instructors will find it useful to -create a file with your course-specific \R\ -scripts, adding on to it and modifying it as the course progresses. -This allows you to distribute all sorts of special-purpose functions, -letting you distribute new \R\ material to your students. For -instance, that brilliant new ``manipulate'' idea you had at 2am can be -programmed up and put in place for your students to use the next -morning in class. Then as you identify bugs and refine the program, -you can make the updated software immediately available to your students. - -For example, in the next section of this book we will discuss reading -directly from Google Spreadsheets. It happens that we wanted to try a -new technique but were not sure that it was worth including in the -\texttt{mosaic} package. So, we need another way to distribute it to -you. Use this statement: -<>= -source("http://dl.dropbox.com/u/5098197/USCOTS2011/USCOTS2011.R") -@ -Among other things, the operator \texttt{readGoogleCSV()} is defined -in the script that gets sourced in by that command. Again, you can -edit the file directly on your computer and have the results instantly -available (subject only to the several second latency of Dropbox) to -your students. Of course, they will have to re-issue the -\function{source} command to re-read the script. - -If privacy is a concern, for instance if you want the data available -only to your students, you can effectively accomplish this -by giving files names known only to your students, e.g., -``Example-A78r423.csv''. - -\Caution{\emph{Security through Obscurity} of this sort will -not generally satisfy institutional data protection regulations nor -professional ethical requirements} - - - -\subsection{GoogleDocs} - -The Dropbox technique is excellent for broadcasting: taking files you -create and distributing them in a read-only fashion to your students. -But when you want two-way or multi-way -sharing of files, other techniques are called for, such as provided by -the GoogleDocs service. - -GoogleDocs allows students and instructors to create various forms of -documents, including reports, presentations, and spreadsheets. (In -addition to creating documents {\em de novo}, Google will also convert -existing documents in a variety of formats.) - -Once on the GoogleDocs system, the documents can be edited -{\em simultaneously} by multiple users in different locations. They -can be shared with individuals or groups and published for -unrestricted viewing and even editing. - -For teaching, this has a variety of uses: -\begin{itemize} - \item Students working on group projects can all simultaneously have - access to the report as it is being written and to data that is - being assembled by the group. - \item The entire class can be given access to a data set, both for - reading and for writing. - \item The Google Forms system can be used to construct surveys, the - responses to which automatically populate a spreadsheet that can - be read by the survey creators. - \item Students can ``hand in'' reports and data sets by copying a link - into a course support system such as Moodle or Blackboard, or - emailing the link. - \item The instructor can insert comments and/or corrections directly - into the document. -\end{itemize} - -An effective technique for organizing student work and ensuring -that the instructor (and other graders) have access to it, is to -create a separate Google directory for each student in your class -(Dropbox can also be used in this manner). -Set the permission on this directory to share it with the -student. Anything she or he drops into the directory is automatically -available to the instructor. The student can also share with specific -other students (e.g., members of a project group). - -\begin{example} -One exercise for students starting out in a statistics course is to -collect data to find out whether the ``close door'' button on an -elevator has any effect. This is an opportunity to introduce simple -ideas of experimental design. But it's also a chance to teach about -the organization of data. - -Have your students, as individuals or small groups, study a particular -elevator, organize their data into a spreadsheet, and hand in their -individual spreadsheet. Then review the spreadsheets in class. You -will likely find that many groups did not understand clearly the -distinction between cases and variables, or coded their data in -ambiguous or inconsistent ways. - -Work with the class to establish a consistent scheme for the variables -and their coding, e.g., a variable \VN{ButtonPress} with levels -``Yes'' and ``No'', a variable \VN{Time} with the time in seconds -from a fiducial time (e.g. when the button was pressed or would have -been pressed) with time measured in seconds, and variables \VN{ElevatorLocation} -and \VN{GroupName}. Create a spreadsheet -with these variables and a few cases filled in. Share it with the class. - -Have each of your students add his or her own data to the class data -set. Although this is a trivial task, having to translate their -individual data into a common format strongly reinforces the -importance of a consistent measurement and coding system for recording -data. -\end{example} - -Once you have a spreadsheet file in GoogleDocs, you will want to open -it in \R. Of course, it's possible to export it as a CSV file, then -open it using the CSV tools in \R, such as \function{read.csv}. -But there are easier ways that let you work with the data ``live.'' - -\paragraph{In the web-server version of \RStudio,} described below, you can - use a menu item to locate and load your spreadsheet. - -\begin{center} - \includegraphics[width=3in]{images/google-spreadsheet1.png} -\end{center} - -\paragraph{If you are using other \R\ interfaces,} you must first use the Google - facilities for publishing documents. -\begin{enumerate} - \item From within the document, use the ``Share'' dropdown menu and - choose ``Publish as a Web Page.'' - \item Press the ``Start Publishing'' button in the ``Publish to the - web'' dialog box. (See figure \ref{fig:publish-google}.) - \item In that dialog box, go to ``Get a link to the published - data.'' Choose the CSV format and copy out the link that's - provided. You can then publish that link on your web site, or via - course-support software. Only people with the link can see the - document, so it remains effectively private to outsiders. -\end{enumerate} - - -\begin{figure} -\begin{center} - \includegraphics[width=4.5in]{images/publishing-google1.png} -\end{center} -\caption{\label{fig:publish-google}Publishing a Google Spreadsheet so that it can be read - directly into \R.} -\end{figure} - -It turns out that communicating with GoogleDocs requires facilities -that are not present in the base version of \R, but are available -through the \texttt{RCurl} package. In order to make these readily -available to students, we have created a function that takes a quoted -string with the Google-published URL and reads the corresponding file -into a data frame: -<>= -elev <- readGoogleCSV( -"https://spreadsheets.google.com/spreadsheet/pub?hl=en&hl=en&key=0Am13enSalO74dEVzMGJSMU5TbTc2eWlWakppQlpjcGc&single=TRUE&gid=0&output=csv") -head(elev) -@ - -Of course, you'd never want your students to type that URL by hand; -you should provide it in a copy-able form on a web site or within a -course support system. - -Note that the \function{readGoogleCSV} function is not part of the -\texttt{mosaic} package. As described previously, we make it -available via an \R\ source file that can be read into the current -session of \R\ using the \function{source} command: -<>= -source("http://dl.dropbox.com/u/5098197/USCOTS2011/USCOTS2011.R") -@ - - - - -\subsection{The \RStudio\ Web Server} - -\RStudio\ is available as a desktop application that provides a -considerately designed interface to the standard \R\ software that you -can install on individual computers. - -But there is another version of \RStudio\ available, one that takes -the form of a web server. There are some substantial advantages to -using the web-server version. -\begin{itemize} -\item For the user, no installation is required beyond a standard web browser. -\item Sessions are continued indefinitely; you can come back to your - work exactly where you left it. -\item A session started on one computer can be continued on another - computer. So a student can move seamlessly from the classroom to - the dorm to a friend's computer. -\item The web-server system provides facilities for direct access to - GoogleDocs. -\end{itemize} - - -As \RStudio\ continues to be developed, we anticipate facilities being -added that will enhance even more the ability to teach with R: - -\Caution{These are anticipated future features.} - - -\begin{itemize} -\item -The ability to create URLs that launch \RStudio\ and read in a data set all in a single -click. -\item The ability to share sessions simultaneously, so that more than - one person can be giving commands. This will be much like Google Docs, but with the - \R\ console as the document. Imagine being able to start a - session, then turn it over to a student in your classroom to give - the commands, with you being able to make corrections as needed. -\item The ability to clone sessions and send them off to others. For - instance, you could set up a problem then pass it along to your - students for them to work on. -\end{itemize} - -But even with the present system, the web-based \RStudio\ version -allows you to work with students effectively during office hours. You -can keep your own version of \RStudio\ running in your usual browser, but give your visiting a -student a window in a new browser: Firefox, Chrome, Safari, Internet -Explorer, etc. Each new browser is effectively a new machine, so your -student can log in securely to his or her own account. - - - -\section{Data Mining Activities} - -\begin{comment} -We end this chapter with several examples that do data mining via the Internet. -Some of these are mere glimpses into what might be possible as tools for -accessing this kind of data become more prevalent and easier to use. -\end{comment} -\subsection{What percentage of Earth is Covered with Water?} -\label{sec:googleMap} -We can estimate the proportion of the world covered with water by randomly -sampling points on the globe and inspecting them using GoogleMaps. - -First, let's do a sample size computation. Suppose we want to -estimate (at the 95\% confidence level) this proportion within $\pm 5$\%. -There are several ways to estimate the necessary sample size, including -algebraically solving -\[ -(1.96) \sqrt{ \hat p (1-\hat p) /n} = 0.05 -\] -for $n$ given some estimated value of $\hat p$. The \function{uniroot()} function -can solve this sort of thing numerically. Here we take an approach -that looks at a table of values of $n$ and $\hat p$ and margin of error. -<>= -n <- seq(50,500, by=50) -p.hat <- seq(.5, .9, by=0.10) -margin_of_error <- function(n, p, conf.level=.95) { - -qnorm( (1-conf.level)/2) * sqrt( p * (1-p) / n ) -} -# calculate margin of error for all combos of n and p.hat -outer(n, p.hat, margin_of_error) -> tbl -colnames(tbl) <- p.hat -rownames(tbl) <- n -tbl -@ -From this it appears that a sample size of approximately 300--400 will get -us the accuracy we desire. A class of students can easily generate -this much data in a matter of minutes if each student inspects 10--20 maps. -The example below assumes a sample size of 10 locations per student. -This can be adjusted depending on the number of students and the desired -margin of error. - -\begin{enumerate} -\item Generate 10 random locations. - -<>= -positions <- rgeo(10); positions -@ - -\item -Open a GoogleMap centered at each position. - -<>= -googleMap(pos=positions, mark=TRUE) -@ -You may need to turn off pop-up blocking for this to work smoothly. - -\item -For each map, record whether the center is located in water or on land. The options \option{mark=TRUE} -is used to place a marker at the center of the map (this is helpful for locations that are close to -the coast). -\begin{center} -\includegraphics[width=.8\textwidth]{images/google-water1} -\end{center} -You can zoom in or out to get a better look. -\begin{center} -\includegraphics[width=.8\textwidth]{images/google-water2} -\end{center} - - -\item -Record your data in a GoogleForm at - -\begin{center} -\url{http://mosaic-web.org/uscots2011/google-water.html} -%\url{https://spreadsheets.google.com/viewform?formkey=dGREcUR6YjRLSWFTWVpNNXA5ZUZ1TXc6MQ} - -\includegraphics[width=.4\textwidth]{images/googleForm-water} -\end{center} - -For the latitude and longitude information, simply copy and paste the output of -<>= -positions -@ -\Caution{This sort of copy-and-paste operation works better in some -browsers (Firefox) than in others (Safari).}% -\item -After importing the data from Google, it is simple to sum the counts across the class. - -<>= -googleData <- data.frame(Water=215, Land=85) -@ - -<>= -sum(googleData$Water) -sum(googleData$Land) -@ - -Then use your favorite method of analysis, perhaps \function{binom.test()}. - -<>= -interval(binom.test(215, 300)) # numbers of successes and trials -@ -\end{enumerate} - - -\subsection{Roadless America} - -The \function{rgeo()} function can also sample within a latitude longitude ``rectangle". -This allows us to sample subsets of the globe. In this activity we will estimate -the proportion of the continental United States that is within 1 mile of a road. - -\begin{enumerate} -\item -Generate a random sample of locations in a box containing the continental United States. -Some of these points may be in Canada, Mexico, an ocean or a major lake. These -will be discarded from our sample before making our estimate. -<>= -positions <- rgeo(10, lonlim=c(-125,-65), latlim=c(25,50)); positions -@ - -\item -Open a GoogleMap centered at each position. This time we'll zoom in a bit and add -a circle of radius 1 to our map. - -<>= -googleMap(pos=positions, mark=TRUE, zoom=12, radius=1) -@ - - -\begin{center} -\includegraphics[width=.8\textwidth]{images/google-roadless} -\end{center} -You may need to turn off pop-up blocking for this to work smoothly. -\item -For each map, record whether the center is close (to a road), far (from a road), water, or foreign. -You may need to zoom in or out a bit to figure this out. - -\end{enumerate} - -\subsection{Variations on the Google Maps theme} - -There are many other quantities one could estimate using these tools. For example: -\begin{enumerate} -\item -What proportion of your home state is within $m$ miles of a lake? (The choice of $m$ may depend upon -your state of interest.) -\item -Use two proportion procedures or chi-squared tests to compare states or continents. -Do all continents have roughly the same proportion of land withing $m$ miles of water (for some $m$)? -Are Utah and Arizona equally roadless? - -\item -In more advanced classes: What is the average distance to the nearest lake (in some region)? -By using concentric circles, one could estimate this from discretized data indicating, for example, -whether the nearest lake is within 1/2 mile, between 1/2 mile and 1 mile, between 1 mile and 2 miles, -between 2 miles, and 4 miles, between 4 miles and 10 miles, or more than 10 miles away. It may be -interesting to discuss what sort of model should be used for distances from random locations to lakes. -(It probably isn't normally distributed.) -\end{enumerate} - -\subsection{Zillow} - -\authNote{NH to work with Duncan about expanding the package and its other API} - -Zillow.com is an online real estate database that can be used to estimate -property values using tax records, sales data, and comparable homes. - -\centerline{\includegraphics[width=3.8in]{images/zillow1.png}} - -The folks who run the site have made an application programming interface (API) -that specifies -how software programs can interface with their system. Duncan Temple Lang has -crafted a package in R that talks to Zillow. -This can be used to dynamically generate datasets for use in courses, after -you (and/or your students) generate a \VN{zillowId} for use with the system. -(Danny Kaplan has used {\tt cars.com} to similar ends). - -\InstructorNote{While this is a cool interface, students tend to be less interested -in house prices than their instructors!} - -In this section, we describe how to use Zillow to generate and analyse a -dataset comprised of comparable sites to an arbitrary house of interest. - -The first step is to create a Zillow account (click on \verb!Register! on the -top right of the page at \verb!zillow.com!). You can set up an account or register -using Facebook. -\SuggestionBox{\pkg{Zillow} is new to the authors and we are still looking for -the ``killer activity'' using \pkg{Zillow}. We wonder if it is possible, for example, -to sample uniformly among all houses in a city or zip code.}% - -Once you have the account, log in, then click on \verb!My Zillow! at the top right. -This should display your profile (in this case, for a user named \verb!SC_z!). - -\centerline{\includegraphics{images/zillow_profile.pdf}} - -Next, -open the page: \url{http://www.zillow.com/webservice/Registration.htm}. This -is the application programming interface (API) request, which requires more information -if you are a real estate professional. Note that there are limits on the use -of these data, which at first glance appear to not contravene use for statistics -activities and data collection. An overview of the API and terms of use can be found -at \url{http://www.zillow.com/howto/api/APIOverview.htm}. - -\centerline{\includegraphics[width=4.6in]{images/zillow_api.pdf}} - -You should receive information about your Zillow ID (a character string -of letters and numbers). - -Once you've set up your Zillow account, and gotten your Zillow Id, -the next step is to install the \pkg{Zillow} package. This package is -not on CRAN, but can be obtained from Omegahat (a different repository from CRAN) using -\authNote{Groan: this has moved again} -<>= -install.packages("RCurl") -install.packages("Zillow", repos="http://www.omegahat.org/R", type="source", - dependencies="Depends") -@ - -Next, you should initialize your \VN{zillowID} to the value that you -received when you registered with {\tt Zillow.com}. -<>= -zillowId <- "set_to_your_zillowId" -@ -<>= -zillowId <- "X1-ZWz1bvi5ru1gqz_4srxq" # this is Nick's, please don't share! -@ - -This allows you to make calls to functions such as \function{zestimate()} (which -allows you to search for information about a particular property) and -\function{getComps()} (which facilitates finding a set of comparable properties. -Here we find information about an arbitrary house in California, as well as -comparable properties. -<>= -require(Zillow) -est <- zestimate("1280 Monterey Avenue", "94707", zillowId) -est -comps <- getComps(rownames(est), zillowId, count=20) -rownames(est) -names(comps) -@ -XX fix this -<>= -table(comps$bathrooms) -perctable(comps$bathrooms) -table(comps$bedrooms) -perctable(comps$bedrooms) -favstats(comps$finishedSqFt) -@ -We can compare numerical summaries of the size of the house for houses with different -numbers of bedrooms: - -XX AND THIS -<>= -require(Hmisc) -summary(finishedSqFt ~ bedrooms, data=comps, fun=favstats) -@ -We can look at the distribution of Zillow price lower bound, upper bound, as well as assessed -(tax) value. -\InstructorNote{This syntax is somewhat dense, since the \function{bwplot()} -function is expecting a data frame, not 3 vectors} -<>= -bwplot( - rep(c("Low", "Assessed", "High"), each=nrow(comps)) ~ c(low, taxAssessment, high), - data=comps, horizontal=TRUE, xlab='value ($)') -@ - -As an alternative to the code above, we could first build a data frame and then -use that for our plot. -\InstructorNote{the \pkg{reshape} package also provides support for -restructuring datasets} -<>= -zillowData <- data.frame( - value = with(comps, c( low, taxAssessment, high )), - type = rep(c("Low", "Assessed", "High"), each=nrow(comps)) - ) -bwplot( value ~ type, zillowData) -@ - -It's interesting that for these properties, assessed values tend to be lower -that both the lower and upper Zillow estimates. -We could explore whether this is true in California more generally. - -It's possible to plot the results of our comparable properties, which yields a scatterplot -of price by square feet (with the number of bedrooms as well as the low and high range) as -well as a scatterplot of amount/finished square feet vs log size in square feet. -<>= -plot(comps) -@ - -Several aspects of this activity are worth noting: -\begin{enumerate} -\item There is some startup cost for instructors and students (since each user will need -their own ZillowID\footnote{By default, the number of calls per day to the API is limited to 1000, -which could easily be exceeded in a lab setting if, contrary to the terms of use, the Zillow ID -were to be shared.}). -\item Once set up, the calls to the Zillow package are very straightforward, and provide -immediate access to a wealth of interesting data. -\item This could be used as an activity to provide descriptive analysis of comparable -properties, or in a more elaborate manner to compare properties in different cities or -areas. -\item Since the latitude and longitude of the comparable properties is returned, users -can generate maps using the mechanisms described in \ref{sec:googleMap}. -\end{enumerate} - -\subsection{Twitter} -\label{sec:twitter} - -\authNote{NH to reimplement the proposed interface} - -Twitter (\url{twitter.com}) is a social networking and microblogging service, where users -(estimated at 200 million as of May, 2011) can send and read posts of up to 140 characters. -Approximately 65 million ``tweets'' are posted per day. - -The \pkg{twitteR} package (due to Jeff Gentry) implements a Twitter client for R. -This can be used to generate data for student projects, class assignments or data mining. - -\Caution{Pulling live data from Twitter may generate some unsavory content} - -The package provides a number of interface functions to connect with the service (see -Table \ref{tab:twitter} for details). -\begin{table} -\begin{center} -\caption{Functions available within the \pkg{twitteR} package} -\label{tab:twitter} -\begin{tabular}{|l|l|l|} \hline -Function & Description \\ \hline -\function{getUser} & returns an object of class {\tt user} \\ -\function{userFriends} & returns lists of class {\tt user} \\ -\function{userFollowers} & returns a list of class {\tt user} \\ -\function{searchTwitter} & returns a list of class {\tt status} \\ -\function{Rtweets} & search for {\tt \#rstats} \\ -\function{showstatus} & take a numeric ID of a tweet and return it \\ -\function{publicTimeline} & current snapshot of the public timeline \\ -\function{userTimeline} & current snapshot of the users public timeline \\ \hline -\end{tabular} -\end{center} -\end{table} - - -Some examples may help to better understand the interface. Let's start by seeing what -the U.S. Census Bureau has been up to recently. -<>= -require(twitteR) -census <- getUser("uscensusbureau") -census$getName() -census$getDescription() -census$getUrl() -census$getStatusesCount() -census$getCreated() -census$getFollowersCount() -@ -That's a lot of followers (but then again, the census is a big job). -Detailed reports of the last set of twitter status updates can be returned. -Let's take a look at what William Shatner has been up to. -<>= -shatner.tweets <- userTimeline("williamshatner", n=5) -@ -By default, the tweets don't print very nicely. So let's define our -own function to extract the text from a tweet and format it more -nicely. -<>= -printTweet <- function(x) paste( strwrap(x$getText())) -printTweet(shatner.tweets[[1]]) -@ -Detailed information can be found for individual tweets. - -<>= -shatner.tweets[[1]]$getCreated() -shatner.tweets[[1]]$getId() -@ -\DiggingDeeper{Section \ref{sec:datastruct} provides details on accessing lists.} - -Using \function{sapply()}, we can inspect all of the tweets at once. -<>= -sapply(shatner.tweets, printTweet) -@ -\DiggingDeeper{the \function{sapply()} applies a function to each element of a list -or vector. It is one of several ``apply'' functions in \R, including -\function{apply}, and -\function{sapply}, and -\function{lapply}, and -\function{tapply}. -The \pkg{mosaic} package even include -\function{dfapply} for application to data frames.} - -Alternatively, we can download information about how active Census Bureau followers are -in terms of posting their own status updates. - -<>= -userFollowers = function(user) { - return(user$getUserFollowers()) -} -census.followers <- userFollowers(census) -followers.ntweets <- sapply(census.followers, function(x) x$getStatusesCount()) -favstats(followers.ntweets) -census.followers[[which.max(followers.ntweets)]] -@ -That's probably a bit much for students to swallow. Let's write a function to hide -some of the gory details from the students. -<>= -CountTweets <- function ( users ) { - return( sapply( users, function(x) x$getStatusesCount()) ) -} -@ -Now we can count tweets like this. -<>= -sort(CountTweets(census.followers)) -@ - - -We can also see what's happening on the \R\ front: -<>= -R.tweets <- Rtweets(n=5) # short for searchTwitter("#rstats", n=5) -sapply(R.tweets, printTweet) -@ - - -Finally, we can see who has been tweeting the most at USCOTS: -<>= -sort(table(sapply(searchTwitter("#uscots11"), function(x) x$getScreenName())), - decreasing=TRUE) -@ -\InstructorNote{Support functions could be written to simplify this interface for -student use. Just another thing for us to do in our spare time...} - -\author - -Further support for posting tweets requires use of the Twitter API (with appropriate -authentication). More details are provided in the \pkg{twitteR} vignette file. - -\subsection{Other ideas} - -We've outlined several approaches which efficiently scrape data from the web. -But there are lots of other examples which may be worth exploring (and this is clearly -a growth area). These include: -\begin{description} -\item[\pkg{RNYTimes}] interface to several of the \emph{New York Times} web services -for searching articles, meta-data, user-generated content and best seller lists -(\url{http://www.omegahat.org/RNYTimes}) -\item[\pkg{Rflickr}] interface to the Flickr photo sharing service -(\url{http://www.omegahat.org/Rflickr}) -\item[\pkg{RGoogleDocs}] interface to allow listing documents on Google Docs -along with details, downloading contents, and uploading files -(\url{http://www.omegahat.org/RGoogleDocs}) -\item[\pkg{RLastFM}] interface to the \verb!last.fm! music recommendation site (\url{http://cran.r-project.org/web/packages/RLastFM}) -\end{description} - diff --git a/Internet/Master/Master-Internet.Rnw b/Internet/Master/Master-Internet.Rnw deleted file mode 100644 index e063df8..0000000 --- a/Internet/Master/Master-Internet.Rnw +++ /dev/null @@ -1,13 +0,0 @@ -% All pre-amble stuff should go into ../include/MainDocument.Rnw -\title{Taking Advantage of the Internet} -\author{Randall Pruim and Nicholas Horton and Daniel Kaplan} -\date{DRAFT: \today} -\Sexpr{set_parent('../../include/MainDocument.Rnw')} % All the latex pre-amble for the book -\maketitle - -\tableofcontents - -\newpage - -\import{../}{Internet} - diff --git a/Internet/Outline-Internet.Rmd b/Internet/Outline-Internet.Rmd deleted file mode 100644 index ca24d0c..0000000 --- a/Internet/Outline-Internet.Rmd +++ /dev/null @@ -1 +0,0 @@ -## Using Internet Services: Outline diff --git a/Internet/images/dropbox1.png b/Internet/images/dropbox1.png deleted file mode 100644 index 4a3bfa4..0000000 Binary files a/Internet/images/dropbox1.png and /dev/null differ diff --git a/Internet/images/dropbox2.png b/Internet/images/dropbox2.png deleted file mode 100644 index 01ba522..0000000 Binary files a/Internet/images/dropbox2.png and /dev/null differ diff --git a/Internet/images/google-spreadsheet1.png b/Internet/images/google-spreadsheet1.png deleted file mode 100644 index b146d6c..0000000 Binary files a/Internet/images/google-spreadsheet1.png and /dev/null differ diff --git a/Internet/images/publishing-google1.png b/Internet/images/publishing-google1.png deleted file mode 100644 index 2bcd818..0000000 Binary files a/Internet/images/publishing-google1.png and /dev/null differ diff --git a/Internet/images/zillow1.png b/Internet/images/zillow1.png deleted file mode 100644 index 2afebcd..0000000 Binary files a/Internet/images/zillow1.png and /dev/null differ diff --git a/Interweb/Internet.Rnw b/Interweb/Internet.Rnw deleted file mode 100644 index 195a0f6..0000000 --- a/Interweb/Internet.Rnw +++ /dev/null @@ -1,1710 +0,0 @@ -<>= -source('../include/setup.R') -opts_chunk$set( fig.path="figure/Internet-fig-" ) -if (!exists("standAlone")) set_parent('../include/MainDocument.Rnw') -set.seed(123) -@ - - -<>= -require(fastR) -@ - -\chapter{Introduction} - - -XX taken straight from Core: needs complete rewrite and reorganization. - -Possible ideas to include -\begin{enumerate} -\item mapping - \begin{enumerate} - \item choropleth - \item access to gapminder - \item Googlemaps - \item what proportion of Michigan is a lake - \end{enumerate} -\item data scraping and remote access - \begin{enumerate} - \item API's - \item twitteR - \item American Community Survey - \item Databases and SQL - \end{enumerate} -\item Other things from CVC or other places? -\end{enumerate} - -In this monograph, we briefly review the commands and functions needed -to analyze data from introductory and second courses in statistics. This is intended to complement -the \emph{Start Teaching with R} and \emph{Start Modeling with R} books. - -Most of our examples will use data from the HELP (Health Evaluation and Linkage to Primary -Care) study: a randomized trial of a novel -way to link at-risk subjects with primary care. More information on the -dataset can be found in chapter \ref{sec:help}. - - -Since the selection and order of topics can vary greatly from -textbook to textbook and instructor to instructor, we have chosen to -organize this material by the kind of data being analyzed. This should make -it straightforward to find what you are looking for even if you present -things in a different order. This is also a good organizational template -to give your students to help them keep straight ``what to do when". - -Some data management is needed by students (and more by instructors). This -material is reviewed in chapter \ref{sec:manipulatingData}. - - -This work leverages initiatives undertaken by Project MOSAIC (\url{http://www.mosaic-web.org}), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the -\pkg{mosaic} package, which was written to simplify the use of R for introductory statistics courses. A short summary of the R needed to teach introductory statistics can be found in the mosaic package vignette (\url{http://cran.r-project.org/web/packages/mosaic/vignettes/MinimalR.pdf}). - -Other related resources from Project MOSAIC may be helpful, including an annotated set of examples -from the -sixth edition of -Moore, McCabe and Craig's \emph{Introduction to the Practice of Statistics}\cite{moor:mcca:2007} (see \url{http://www.amherst.edu/~nhorton/ips6e}) as well as -the second edition of the \emph{Statistical Sleuth}\cite{Sleuth2} (see \url{http://www.amherst.edu/~nhorton/sleuth}). - -To use a package within R, it must be installed (one time), and loaded (each session). The -\pkg{mosaic} package can be installed using the following command: -<>= -install.packages('mosaic') # note the quotation marks -@ -The {\tt \#} character is a comment in R, and all text after that on the -current line is ignored. - -Once the package is installed (one time only), it can be loaded by running the command: -<>= -require(mosaic) -@ - - -\chapter{One Quantitative Variable} - -\section{Numerical summaries} - -\R\ includes a number of commands to numerically summarize variables. -These include the capability of calculating the mean, standard deviation, -variance, median, five number summary, intra-quartile range (IQR) as well as arbitrary quantiles. We will -illustrate these using the CESD (Center for Epidemiologic Studies--Depression) -measure of depressive symptoms (which takes on values between 0 and 60, with higher -scores indicating more depressive symptoms). - -To improve the legibility of output, -we will also set the default number of digits to display to a more reasonable -level (see \function{?options} for more configuration possibilities). - -<>= -require(mosaic) -options(digits=3) -@ -Note that the \function{mean()} function in the \pkg{mosaic} package supports a modeling language -common to \pkg{lattice} graphics and linear models (e.g. \function{lm()}). We will use -this modeling language commands throughout this document. -\DiggingDeeper{The \emph{Start Modeling with R} book will be helpful if you are unfamiliar with the -modeling language.} -<<>>= -mean(~ cesd, data=HELPrct) -@ - -The same output could be -created using the following commands (though we will use the MOSAIC versions when available). -<<>>= -with(HELPrct, mean(cesd)) -mean(HELPrct$cesd) -@ -Similar functionality exists for other summary statistics. -<>= -sd(~ cesd, data=HELPrct) -@ -<>= -sd(~ cesd, data=HELPrct)^2 -var(~ cesd, data=HELPrct) -@ - -It is also straightforward to calculate quantiles of the distribution. - -<>= -median(~ cesd, data=HELPrct) -@ - -By default, the -\function{quantile()} function displays the quartiles, but can be given -a vector of quantiles to display. -\Caution{Not all commands (including \function{quantile()}) have been upgraded to -support the formula interface. These must be accessed using \function{with()} or the \$ operator.} -<>= -with(HELPrct, quantile(cesd)) -with(HELPrct, quantile(cesd, c(.025, .975))) -@ - -Finally, the \function{favstats()} -function in the \pkg{mosaic} package provides a concise summary of -many useful statistics. -<<>>= -favstats(~ cesd, data=HELPrct) -@ - -\section{Graphical summaries} -The \function{histogram()} function is used to create a histogram. -\FoodForThought{\code{x} is for eXtra.}% -Here we use the formula interface (as discussed in the \emph{Start Modeling with R} book) to -specify that we want a histogram of the CESD scores. - -\vspace{-4mm} -\begin{center} -<>= -histogram(~ cesd, data=HELPrct) -@ -\end{center} - - -In the \variable{HELPrct} dataset, approximately one quarter of the subjects are female. -<<>>= -tally(~ sex, data=HELPrct) -tally(~ sex, format="percent", data=HELPrct) -@ -It is straightforward to restrict our attention to just those subjects. -If we are going to do many things with a subset of our data, it may be easiest -to make a new data frame containing only the cases we are interested in. -The \function{subset()} function can generate a new data frame containing -just the women or just the men (see also section \ref{sec:subsets}). Once this is created, we -used the \function{stem()} function to create a stem and leaf plot. -\Caution{Note that the equality operator is \emph{two} equal signs} -<>= -female <- subset(HELPrct, sex=='female') -male <- subset(HELPrct, sex=='male') -with(female, stem(cesd)) -@ - -Subsets can also be generated and used on the fly (this time including -an overlaid normal density): -<>= -histogram(~ cesd, fit="normal", - data=subset(HELPrct, sex=='female')) -@ - -Alternatively, we can make side-by-side plots to compare multiple subsets. -<>= -histogram(~ cesd | sex, data=HELPrct) -@ - -The layout can be rearranged. -\begin{center} -<>= -histogram(~ cesd | sex, layout=c(1, 2), data=HELPrct) -@ -\end{center} -\begin{problem} -Using the \dataframe{HELPrct} dataset, -create side-by-side boxplots of the CESD scores by substance abuse -group, just for the male subjects, with an overlaid normal density. -\end{problem}% -\begin{solution} -<>= -bwplot(cesd ~ substance, fit="normal", - data=subset(HELPrct, sex=='male')) -@ -\end{solution}% -We can control the number of bins in a number of ways. These can be specified -as the total number. -\begin{center} -<>= -histogram(~ cesd, nint=20, data=female) -@ -\end{center} -Or the width can be specified. -\begin{center} -<>= -histogram(~ cesd, width=1, data=female) -@ -\end{center} -We could also have made our subset ``on the fly'', just for the purposes of graphing: -\begin{center} -<>= -histogram(~ cesd, data=HELPrct, subset=(sex=='female')) -@ -\end{center} - -The \function{dotPlot()} function is used to create a dotplot (a la Fathom) -for a smaller subset of subjects (homeless females). We also demonstrate -how to change the x-axis label. -<>= -dotPlot(~ cesd, xlab="CESD score", - data=subset(HELPrct, sex=="female" & homeless=="homeless")) -@ - - -\section{Density curves} - -One disadvantage of histograms is that they can be sensitive to the choice of the -number of bins. Another display to consider is a density curve. -\FoodForThought{Density plots are also sensitive to certain choices. If your density plot -is too jagged or too smooth, try adjusting the \option{adjust} argument (larger than 1 for -smoother plots, less than 1 for more jagged plots).} - -Here we adorn a density plot with some gratuitous additions to -demonstrate how to build up a graphic for pedagogical purposes. -We add some text, a superimposed normal density as well as -a vertical line. A variety of line types can be specified, -as well as line widths. - -\begin{center} -<>= -densityplot(~ cesd, data=female) -ladd(grid.text(x=0.2, y=0.8, 'only females')) -with(female, ladd(panel.mathdensity(args= - list(mean=mean(cesd), sd=sd(cesd)), col="red"))) -ladd(panel.abline(v=60, lty=2, lwd=2, col="grey")) -@ -\end{center} -\DiggingDeeper{The \function{plotFun()} function can also be used to annotate plots (see -section \ref{sec:plotFun}).} - - -\section{Normal distributions} - -The most famous density curve is a normal distribution. The \function{xpnorm()} function -displays the probability that a random variable is less than the first argument, for a -normal distribution with mean given by the second argument and standard deviation by the -third. More information about probability distributions can -be found in section \ref{sec:probability}. -\FoodForThought{\code{x} is for eXtra.} -\begin{center} -<>= -xpnorm(1.96, mean=0, sd=1) -@ -\end{center} - -\section{Inference for a single sample} -\label{sec:bootstrapsing} - -We can calculate a 95\% confidence interval for the mean CESD -score for females by using a t-test: -<>= -t.test(~ cesd, data=female) -confint(t.test(~ cesd, data=female)) -@ - -But it's also straightforward to calculate this using a bootstrap -(using the approach described in the \pkg{mosaic} package Resampling Vignette). -The statistic that we want to resample is the mean. -<>= -mean(~ cesd, data=female) -@ - -One resampling trial can be carried out: -<>= -mean(~ cesd, data=resample(female)) -@ -Another will yield different results: -<<>>= -mean(~ cesd, data=resample(female)) -@ -\TeachingTip{Even though a single trial is of little use, it's smart have -students do the calculation to show that they are (usually!) getting a different -result than without resampling.} - -Now conduct 1000 resampling trials, saving the results in an object -called \texttt{trials}: -<>= -trials = do(1000) * mean(~ cesd, data=resample(female)) -with(trials, quantile(result, c(.025, .975))) -@ - -\chapter{One Categorical Variable} - -\section{Numerical summaries} - -The \function{tally()} function can be used to calculate -counts, percentages and proportions for a categorical variable. - -<>= -tally(~ homeless, data=HELPrct) -tally(~ homeless, margins=FALSE, data=HELPrct) -tally(~ homeless, format="percent", data=HELPrct) -tally(~ homeless, format="proportion", data=HELPrct) -@ - -\section{The binomial test} - -An exact confidence interval for a proportion (as well as a test of the null -hypothesis that the population proportion is equal to a particular value [by default 0.5]) can be calculated -using the \function{binom.test()} function. -The standard \function{binom.test()} requires us to tabulate. -<<>>= -binom.test(209, 209 + 244) -@ -The \pkg{mosaic} package provides a formula interface that avoids the need to pre-tally -the data. -<>= -result <- binom.test(~ homeless=="homeless", HELPrct) -result -@ - -As is generally the case with commands of this sort, -there are a number of useful quantities available from -the object returned by the function. -<<>>= -names(result) -@ -These can be extracted using the {\tt \$} operator or an extractor function. -For example, the user can extract the confidence interval or p-value. -<>= -result$statistic -confint(result) -pval(result) -@ -\DiggingDeeper{Most of the objects in \R\ have a \function{print()} -method. So when we get \code{result}, what we are seeing displayed in the console is -\code{print(result)}. There may be a good deal of additional information -lurking inside the object itself. To make matter even more complicated, some -objects are returned \emph{invisibly}, so nothing prints. You can still assign -the returned object to a variable and process it later, even if nothing shows up -on the screen. This is the case for the \pkg{lattice} graphics functions, for example. -You can save a plot into a variable, say \code{myplot} and display the plot again later -using \code{print(myplot)}.}% - - -\section{The proportion test} - -A similar interval and test can be calculated using \function{prop.test()}. -<>= -tally(~ homeless, data=HELPrct) -prop.test(~ homeless=="homeless", correct=FALSE, data=HELPrct) -@ -It also accepts summarized data, the way \function{binom.test()} does. -\InstructorNote{\function{prop.test()} calculates a Chi-squared statistic. -Most introductory texts use a $z$-statistic. They are mathematically equivalent -in terms of inferential statements, but -you may need to address the discrepancy with your students.}% -<<>>= -prop.test(209, 209 + 244, correct=FALSE) -@ -To make things simpler still, we've added a formula interface in the \pkg{mosaic} package. -<<>>= -prop.test(~ homeless, data=HELPrct) -@ - -\section{Goodness of fit tests} - -A variety of goodness of fit tests can be calculated against a reference -distribution. For the HELP data, we could test the null hypothesis that there is an equal -proportion of subjects in each substance abuse group back in the original populations. - -\Caution{The \option{margins=FALSE} option is needed here to include only the counts.} -<>= -tally(~ substance, format="percent", data=HELPrct) -observed <- tally(~ substance, margins=FALSE, data=HELPrct) -observed -@ -<>= -p <- c(1/3, 1/3, 1/3) # equivalent to rep(1/3, 3) -chisq.test(observed, p=p) -total <- sum(observed); total -expected <- total*p; expected -@ - -We can also calculate this quantity manually, in terms of observed and expected values. - -\TeachingTip{We don't encourage much manual calculations in our courses.} -<>= -chisq <- sum((observed - expected)^2/(expected)); chisq -1 - pchisq(chisq, df=2) -@ - -Alternatively, the \pkg{mosaic} package provides a version of \function{chisq.test()} with -more verbose output. -\FoodForThought{\code{x} is for eXtra.} -<<>>= -xchisq.test(observed, p=p) -# clean up variables no longer needed -rm(observed, p, total, chisq) -@ - - -\chapter{Two Quantitative Variables} - -\section{Scatterplots} - -We always encourage students to start any analysis by graphing their data. -Here we augment a scatterplot -of the CESD (a measure of depressive symptoms, higher scores indicate more symptoms) and the MCS (mental component score from the SF-36, where higher scores indicate better functioning) -with a lowess (locally weighted scatterplot smoother) line, using a circle -as the plotting character and slightly thicker line. - -\InstructorNote{The lowess line can help to assess linearity of a relationship.} -\begin{center} -<>= -xyplot(cesd ~ mcs, type=c('p','smooth'), pch=1, cex=0.6, - lwd=3, data=HELPrct) -@ -\end{center} - - -\section{Correlation} - -Correlations can be calculated for a pair of variables, or for a matrix of variables. -<<>>= -cor(cesd, mcs, data=HELPrct) -smallHELP = subset(HELPrct, select=c(cesd, mcs, pcs)) -cor(smallHELP) -@ - -By default, Pearson correlations are provided, other variants (e.g. Spearman) can be specified using the -\option{method} option. -<<>>= -cor(cesd, mcs, method="spearman", data=HELPrct) -@ - -\section{Simple linear regression} - -\InstructorNote{We tend to introduce linear regression -early in our courses, as a purely descriptive technique.} - -Linear regression models are described in detail in \emph{Start Modeling with R}. -These use the same formula interface introduced previously for numerical and graphical -summaries -to specify the outcome -and predictors. Here we consider fitting the model \model{\variable{cesd}}{\variable{mcs}}. - - -<<>>= -model <- lm(cesd ~ mcs, data=HELPrct) -coef(model) -@ -To simplify the output, we turn off the option to display significance stars. -<<>>= -options(show.signif.stars=FALSE) -coef(model) -summary(model) -confint(model) -r.squared(model) -@ - - -<>= -class(model) -@ -The return value from \function{lm()} is a linear model object. -A number of functions can operate on these objects, as -seen previously with \function{coef()}. -The function \function{residuals()} returns a -vector of the residuals. -\FoodForThought{The function \function{residuals()} can be abbreviated -\function{resid()}. Another useful function is \function{fitted()}, which -returns a vector of predicted values.} - -\begin{center} -<>= -histogram(~ residuals(model), density=TRUE) -@ -\end{center} -\begin{center} -<>= -qqmath(~ resid(model)) -@ -\end{center} -\begin{center} -<>= -xyplot(resid(model) ~ fitted(model), type=c("p", "smooth", "r"), alpha=0.5, cex=0.3, pch=20) -@ -\end{center} - -Prediction bands can be added to a plot using the \function{panel.lmbands()} function. -\begin{center} -<>= -xyplot(cesd ~ mcs, panel=panel.lmbands, cex=0.2, - band.lwd=2, data=HELPrct) -@ -\end{center} -\begin{problem} -Using the \dataframe{HELPrct} dataset, fit a simple linear regression model -predicting the number of drinks per day as a function of the mental -component score. -This model can be specified using the formula: -\model{\variable{i1}}{\variable{mcs}}. -Assess the distribution of the residuals for this model. -\end{problem} - - -\chapter{Two Categorical Variables} - - -\section{Cross classification tables} -\label{sec:cross} - -Cross classification (two-way or $R$ by $C$) tables can be constructed for -two (or more) categorical variables. Here we consider the contingency table -for homeless status (homeless one or more nights in the past 6 months or housed) -and sex. - -<>= -tally(~ homeless + sex, margins=FALSE, data=HELPrct) -@ - -We can calculate the odds ratio directly from the table: -<>= -OR = (40*177)/(67*169); OR -@ - -The -\pkg{mosaic} package has a function which will calculate odds ratios: -<>= -oddsRatio(tally(~ homeless + sex, margins=FALSE, - data=HELPrct)) -@ -Note that the reference group is flipped (and that $1/1.599 = \Sexpr{1/1.599}$). - -Graphical summaries of cross classification tables may be helpful in visualizing -associations. Mosaic plots are one example (though the jury is still out -regarding their utility, relative to the low data to ink ratio\cite{Tufte:2001:Visual}). -Here we see that males tend to be over-represented -amongst the homeless subjects (as represented by the horizontal line which is higher for -the homeless rather than the housed). -\FoodForThought{The \function{mosaic()} function -in the \pkg{vcd} package also makes mosaic plots.} -\begin{center} -<>= -mytab <- tally(~ homeless + sex, margins=FALSE, - data=HELPrct) -mosaicplot(mytab) -@ -\end{center} - -\section{Chi-squared tests} - -<>= -chisq.test(tally(~ homeless + sex, margins=FALSE, - data=HELPrct), correct=FALSE) -@ - -There is a statistically significant association found: it is unlikely that we would observe -an association this strong if homeless status and sex were independent in the -population. - -When a student finds a significant association, -it's important for them to be able to interpret this in the context of the problem. -The \function{xchisq.test()} function provides additional details to help with this process. -\FoodForThought{\code{x} is for eXtra.} -<>= -xchisq.test(tally(~homeless + sex, margins=FALSE, - data=HELPrct), correct=FALSE) -@ - -We observe that there are fewer homeless women, and more homeless men that would be expected. - -\section{Fisher's exact test} - -An exact test can also be calculated. This is fairly computationally straightforward for 2 by 2 -tables. Options to help constrain the size of the problem for larger tables exist -(see \verb!?fisher.test()!). - -\DiggingDeeper{Note the different estimate of the odds ratio from that seen in section \ref{sec:cross}. -The \function{fisher.test()} function uses a different estimator (and different interval based -on the profile likelihood).} -%with(HELPrct, fisher.test(homeless, sex)) -<>= -fisher.test(tally(~homeless + sex, margins=FALSE, - data=HELPrct)) -@ - -\chapter{Quantitative Response to a Categorical Predictor} - -\section{A dichotomous predictor: numerical and graphical summaries} -Here we will compare the distributions of CESD scores by sex. - -The \function{mean()} function can be used to calculate the mean CESD score -separately for males and females. -<>= -mean(cesd ~ sex, data=HELPrct) -@ - -The \function{favstats()} function can provide more statistics by group. -<>= -favstats(cesd ~ sex, data=HELPrct) -@ - - -Boxplots are a particularly helpful graphical display to compare distributions. -The \function{bwplot()} function can be used to display the boxplots for the -CESD scores separately by sex. We see from both the numerical and graphical -summaries that women tend to have slightly higher CESD scores than men. - -\FoodForThought{Although we usually put explanatory variables along the horizontal axis, -page layout sometimes makes the other orientation preferable for these plots.} -%\vspace{-8mm} -\begin{center} -<>= -bwplot(sex ~ cesd, data=HELPrct) -@ -\end{center} - -When sample sizes are small, there is no reason to summarize with a boxplot -since \function{xyplot()} can handle categorical predictors. -Even with 10--20 observations in a group, a scatter plot is often quite readable. -Setting the alpha level helps detect multiple observations with the same value. -\FoodForThought{One of us once saw a biologist proudly present -side-by-side boxplots. Thinking a major victory had been won, he naively -asked how many observations were in each group. ``Four,'' replied the -biologist.} -\begin{center} -<>= -xyplot(sex ~ length, KidsFeet, alpha=.6, cex=1.4) -@ -\end{center} - -\section{A dichotomous predictor: two-sample t} - -The Student's two sample t-test can be run without or with an equal variance assumption. -<>= -t.test(cesd ~ sex, var.equal=FALSE, data=HELPrct) -@ -We see that there is a statistically significant difference between the two groups. - -The groups can also be compared using the \function{lm()} function (with an equal variance assumption). -<<>>= -summary(lm(cesd ~ sex, data=HELPrct)) -@ - - -\section{Non-parametric 2 group tests} - -The same conclusion is reached using a non-parametric (Wilcoxon rank sum) test. - -<>= -wilcox.test(cesd ~ sex, data=HELPrct) -@ - - -\section{Permutation test} - -Here we extend the methods introduced in section \ref{sec:bootstrapsing} to -undertake a two-sided test comparing the ages at baseline by gender. First we calculate the observed difference in means: -<<>>= -mean(age ~ sex, data=HELPrct) -test.stat <- compareMean(age ~ sex, data=HELPrct) -test.stat -@ -We can calculate the same statistic after shuffling the group labels: -<<>>= -do(1) * compareMean(age ~ shuffle(sex), data=HELPrct) -do(1) * compareMean(age ~ shuffle(sex), data=HELPrct) -do(3) * compareMean(age ~ shuffle(sex), data=HELPrct) -@ - -<>= -rtest.stats = do(500) * compareMean(age ~ shuffle(sex), - data=HELPrct) -histogram(~ result, n=40, xlim=c(-6, 6), - groups=result >= test.stat, pch=16, cex=.8, - data=rtest.stats) -ladd(panel.abline(v=test.stat)) -@ - -Here we don't see much evidence to contradict the null hypothesis that men and -women -have the same mean age in the population. - -\section{One-way ANOVA} - -Earlier comparisons were between two groups: we can also consider testing differences between -three or more groups using one-way ANOVA. Here we compare -CESD scores by primary substance of abuse (heroin, cocaine, or alcohol). - -\begin{center} -<>= -bwplot(cesd ~ substance, data=HELPrct) -@ -\end{center} - - -<>= -mean(cesd ~ substance, data=HELPrct) -@ -<>= -mod <- aov(cesd ~ substance, data=HELPrct) -summary(mod) -@ -While still high (scores of 16 or more are generally considered to be -``severe'' symptoms), the cocaine-involved group tend to have lower -scores than those whose primary substances are alcohol or heroin. -<>= -mod1 <- lm(cesd ~ 1, data=HELPrct) -mod2 <- lm(cesd ~ substance, data=HELPrct) -@ -The \function{anova()} command can summarize models. -<<>>= -anova(mod2) -@ -The \function{anova()} command can also be used to formally -compare two (nested) models. -<>= -anova(mod1, mod2) -@ - - -\section{Tukey's Honest Significant Differences} - -There are a variety of multiple comparison procedures that can be -used after fitting an ANOVA model. One of these is Tukey's Honest -Significant Difference (HSD). Other options are available within the -\pkg{multcomp} package. - -<>= -favstats(cesd ~ substance, data=HELPrct) -@ -<>= -HELPrct <- transform(HELPrct, subgrp = factor(substance, - levels=c("alcohol", "cocaine", "heroin"), - labels=c("A", "C", "H"))) -mod <- lm(cesd ~ subgrp, data=HELPrct) -compare <- TukeyHSD(mod, "subgrp") -compare -@ -<>= -plot(compare,cex.lab=0.5) -@ - -Again, we see that the cocaine group has significantly lower CESD scores -than the other two groups. - -\chapter{Categorical Response to a Quantitative Predictor} - -\section{Logistic regression} - -Logistic regression is available using the \function{glm()} function, -which supports -a variety of -link functions and distributional forms for generalized linear models, including logistic regression. -\FoodForThought{The \function{glm()} function has arguments \option{family}, which can take an option -\option{link}. The \code{logit} link is the default link for the binomial family, -so we don't need to specify it here.} -<>= -logitmod <- glm(homeless ~ age + female, family=binomial, - data=HELPrct) -summary(logitmod) -exp(coef(logitmod)) -exp(confint(logitmod)) -@ - -\authNote{Add a plot with logistic fit overlaid?} - -\chapter{Survival Time Outcomes} - -Extensive support for survival (time to event) analysis is available within the -\pkg{survival} package. - -\section{Kaplan-Meier plot} - -\begin{center} -<>= -require(survival) -fit <- survfit(Surv(dayslink, linkstatus) ~ treat, - data=HELPrct) -plot(fit, conf.int=FALSE, lty=1:2, lwd=2, - xlab="time (in days)", ylab="P(not linked)") -legend(20, 0.4, legend=c("Control", "Treatment"), - lty=c(1,2), lwd=2) -title("Product-Limit Survival Estimates (time to linkage)") -@ -\end{center} - -We see that the subjects in the treatment (Health Evaluation and Linkage to Primary Care clinic) were significantly more likely to -link to primary care (less likely to ``survive'') than the control (usual care) group. - -\section{Cox proportional hazards model} - -<>= -require(survival) -summary(coxph(Surv(dayslink, linkstatus) ~ age + substance, - data=HELPrct)) -@ - -Neither age, nor substance group was significantly associated with linkage to primary care. - - -\chapter{More than Two Variables} - -\section{Two (or more) way ANOVA} - -We can fit a two (or more) way ANOVA model, without or with an interaction, -using the same modeling syntax. -<>= -median(cesd ~ substance | sex, data=HELPrct) -bwplot(cesd ~ subgrp | sex, data=HELPrct) -@ -<>= -summary(aov(cesd ~ substance + sex, data=HELPrct)) -@ -<>= -summary(aov(cesd ~ substance * sex, data=HELPrct)) -@ -There's little evidence for the interaction, though there are statistically -significant main effects terms for \variable{substance} group and -\variable{sex}. - -<>= -xyplot(cesd ~ substance, groups=sex, type='a', - data=HELPrct) -@ - - -\section{Multiple regression} - -Multiple regression is a logical extension of the prior commands, adding -additional predictors (and allowing students to start to try to disentangle -multivariate relationships). - -\InstructorNote{We also tend to introduce multiple linear regression -early in our courses, as a purely descriptive technique, then return to it -regularly. The motivation for this is described at length in the companion volume -\emph{Start Modeling with R}.} - -Here we consider a model (parallel slopes) for depressive symptoms as a function of Mental Component Score (MCS), -age (in years) and sex of the subject. -<>= -lm1 <- lm(cesd ~ mcs + age + sex, data=HELPrct) -summary(lm1) -@ -We can also fit a model that includes an interaction between MCS and sex. -<>= -lm2 <- lm(cesd ~ mcs + age + sex + mcs:sex, data=HELPrct) -summary(lm2) -anova(lm2) -@ -<>= -anova(lm1, lm2) -@ -There is little evidence for an interaction effect, so we drop -this from the model. - -\subsection{Visualizing the results from the regression} - -\label{sec:plotFun} -The \function{makeFun()} and \function{plotFun()} functions from the \pkg{mosaic} package -can be used to display the results from a regression model. For this example, we might -display the predicted CESD values for a range of MCS values a 36 year old male and female subject from the parallel -slopes model. -<>= -lm1fun = makeFun(lm1) -@ -We can now plot this function for a range of values for MCS (mental component score), along -with the observed data for 36 year olds. -<>= -xyplot(cesd ~ mcs, groups=sex, auto.key=TRUE, - data=subset(HELPrct, age==36)) -plotFun(lm1fun(mcs, age=36, sex="male") ~ mcs, - xlim=c(0,60), lwd=2, ylab="predicted CESD", add=TRUE) -plotFun(lm1fun(mcs, age=36, sex="female") ~ mcs, - xlim=c(0,60), lty=2, lwd=3, add=TRUE) -@ - - - -\subsection{Residual diagnostics} - -It's straightforward to undertake residual diagnostics for this model. We begin by adding the -fitted values and residuals to the dataset. -\Caution{Be careful when fitting regression models with missing values (see also section \ref{sec:miss}).} -<<>>= -HELPrct = transform(HELPrct, residuals = resid(lm1)) -HELPrct = transform(HELPrct, pred = fitted(lm1)) -@ -<>= -histogram(~ residuals, xlab="residuals", fit="normal", - data=HELPrct) -@ -We can display observations with extremely large residuals. -<<>>= -subset(HELPrct, abs(residuals) > 25) -@ - -<>= -xyplot(residuals ~ pred, ylab="residuals", cex=0.3, - xlab="predicted values", main="predicted vs. residuals", - type=c("p", "r", "smooth"), data=HELPrct) -@ -<>= -xyplot(residuals ~ mcs, xlab="mental component score", - ylab="residuals", cex=0.3, - type=c("p", "r", "smooth"), data=HELPrct) -@ -The assumptions of normality, linearity and homoscedasticity seem reasonable here. -\begin{problem} -The \dataframe{RailTrail} dataset within the \pkg{mosaic} package includes the counts -of crossings of a rail trail in Northampton, Massachusetts for 90 days in 2005. -City officials are interested in understanding usage of the trail network, and -how it changes as a function of temperature and day of the week. -Describe the distribution of the variable \variable{avgtemp} in terms of its -center, spread and shape. -<>= -favstats(~ avgtemp, data=RailTrail) -densityplot(~ avgtemp, xlab="Average daily temp (degrees F)", - data=RailTrail) -@ -\end{problem} -\begin{solution} -The distribution of average temperature (in degrees Fahrenheit) is approximately normally -distributed with mean 57.4 degrees and standard deviation of 11.3 degrees. -\end{solution} -\begin{problem} -The \dataframe{RailTrail} dataset also includes a variable called \variable{cloudcover}. -Describe the distribution of this variable in terms of its -center, spread and shape. -\end{problem} -\begin{solution} -<<>>= -favstats(~ cloudcover, data=RailTrail) -densityplot(~ cloudcover, data=RailTrail) -@ -The distribution of cloud cover is ungainly (almost triangular), with increasing probability for more -cloudcover. The mean is 5.8 oktas (out of 10), with standard deviation of 3.2 oktas. It tends to be -cloudy in Northampton! -\end{solution} -\begin{problem} -The variable in the \dataframe{RailTrail} dataset that provides the daily count -of crossings is called \variable{volume}. -Describe the distribution of this variable in terms of its -center, spread and shape. -\end{problem} -\begin{solution} -<<>>= -favstats(~ volume, data=RailTrail) -densityplot(~ volume, xlab="# of crossings", data=RailTrail) -subset(RailTrail, volume > 700) -@ -The distribution of daily crossings is approximately normally -distributed with mean 375 crossings and standard deviation of 127 crossings. -There is one outlier with 736 crossings which occurred on a Monday holiday in the spring -(Memorial Day). -\end{solution} -\begin{problem} -The \dataframe{RailTrail} dataset also contains an indicator of whether the day was -a weekday (\variable{weekday==1}) or a weekend/holiday (\variable{weekday==0}). -Use \function{tally()} to describe the distribution of this categorical variable. -What percentage of the days are weekends/holidays? -\end{problem} -\begin{solution} -<<>>= -tally(~ weekday, data=RailTrail) -tally(~ weekday, format="percent", data=RailTrail) -@ -Just over 30\% of the days are weekends or holidays. -\end{solution} -\begin{problem} -Use side-by-side boxplots to compare the distribution of \variable{volume} by day type in the \dataframe{RailTrail} dataset. -Hint: you'll need to turn the numeric \variable{weekday} variable into a factor variable using \function{as.factor()}. -What do you conclude? -\end{problem} -\begin{solution} -<<>>= -bwplot(volume ~ as.factor(weekday), data=RailTrail) -@ -or -<<>>= -RailTrail = transform(RailTrail, daytype = ifelse(weekday==1, "weekday", "weekend/holiday")) -bwplot(volume ~ daytype, data=RailTrail) -@ -We see that the weekend/holidays tend to have more users. -\end{solution} - -\begin{problem} -Use overlapping densityplots to compare the distribution of \variable{volume} by day type in the -\dataframe{RailTrail} dataset. -What do you conclude? -\end{problem} -\begin{solution} -<<>>= -densityplot(volume ~ weekday, auto.key=TRUE, data=RailTrail) -@ -We see that the weekend/holidays tend to have more users. -\end{solution} -\begin{problem} -Create a scatterplot of \variable{volume} as a function of \variable{avgtemp} using the \dataframe{RailTrail} dataset, along with a regression line and scatterplot -smoother (lowess curve). What do you observe about the relationship? -\end{problem} -\begin{solution} -<<>>= -xyplot(volume ~ avgtemp, xlab="average temperature (degrees F)", - type=c("p", "r", "smooth"), lwd=2, data=RailTrail) -@ -We see that there is a positive relationship between these two variables, but the association is -somewhat nonlinear (which makes sense as we wouldn't continue to predict an increase in usage when the -temperature becomes uncomfortably warm). -\end{solution} -\begin{problem} -Using the \dataframe{RailTrail} dataset, -fit a multiple regression model for \variable{volume} as a function of \variable{cloudcover}, \variable{avgtemp}, -\variable{weekday} and the interaction -between day type and average temperature. -Is there evidence to retain the interaction term at the $\alpha=0.05$ level? -\end{problem} -\begin{solution} -<<>>= -fm = lm(volume ~ cloudcover + avgtemp + weekday + avgtemp:weekday, data=RailTrail) -summary(fm) -@ -The interaction between average temperature and day-type is statistically significant (p=0.016). We -interpret this as being a steeper slope (stronger association) on weekdays rather than weekends. -(Perhaps on weekends/holidays people will tend to head out on the trails irrespective of the weather?) -\end{solution} -\begin{problem} -Use \function{makeFun()} to calculate the predicted number of crossings on a weekday with average -temperature 60 degrees and no clouds. Verify this calculation using the coefficients from the -model. -<<>>= -coef(fm) -@ -\end{problem} -\begin{solution} -<<>>= -myfun = makeFun(fm) -myfun(cloudcover=0, avgtemp=60, weekday=1) -@ -We expect just over 480 crossings on a day with these characteristics. -\end{solution} -\begin{problem} -Use \function{makeFun()} and \function{plotFun()} to display predicted values for the number of crossings -on weekdays and weekends/holidays for average temperatures between 30 and 80 degrees and a cloudy day -(\variable{cloudcover=10}). -\end{problem} -\begin{solution} -<<>>= -myfun = makeFun(fm) -xyplot(volume ~ avgtemp, data=RailTrail) -plotFun(myfun(cloudcover=10, avgtemp, weekday=0) ~ avgtemp, lwd=2, add=TRUE) -plotFun(myfun(cloudcover=10, avgtemp, weekday=1) ~ avgtemp, lty=2, lwd=3, add=TRUE) -@ -We -interpret this as being a steeper slope (stronger association) on weekdays rather than weekends. -(Perhaps on weekends/holidays people will tend to head out on the trails irrespective of the weather?) -\end{solution} -\begin{problem} -Using the multiple regression model, generate a histogram (with overlaid normal -density) to assess the normality of the residuals. -\end{problem} -\begin{solution} -<<>>= -histogram(~ resid(fm), fit="normal") -@ -The distribution is approximately normal. -\end{solution} -\begin{problem} -Using the same model generate a scatterplot of the residuals versus predicted values and comment -on the linearity of the model and assumption of equal variance. -\end{problem} -\begin{solution} -<<>>= -xyplot(resid(fm) ~ fitted(fm), type=c("p", "r", "smooth")) -@ -The association is fairly linear, except in the tails. There's some evidence that the variability -of the residuals increases with larger fitted values. -\end{solution} -\begin{problem} -Using the same model generate a scatterplot of the residuals versus average temperature and comment -on the linearity of the model and assumption of equal variance. -\end{problem} -\begin{solution} -<<>>= -xyplot(resid(fm) ~ avgtemp, type=c("p", "r", "smooth"), data=RailTrail) -@ -The association is somewhat non-linear. There's some evidence that the variability -of the residuals increases with larger fitted values. -\end{solution} - -\chapter{Probability Distributions and Random Variables} - -\label{sec:DiscreteDistributions} -\label{sec:probability} - -\R\ can calculate quantities related to probability distributions of all types. -It is straightforward to generate -random variables from these distributions, which can be used -for simulation and analysis. -<>= -xpnorm(1.96, mean=0, sd=1) # P(Z < 1.96) -@ -<<>>= -# value which satisfies P(Z < z) = 0.975 -qnorm(.975, mean=0, sd=1) -integrate(dnorm, -Inf, 0) # P(Z < 0) -@ -The following table displays the basenames for probability distributions -available within base \R. These functions can be prefixed by {\tt d} to -find the density function for the distribution, {\tt p} to find the -cumulative distribution function, {\tt q} to find quantiles, and {\tt r} to -generate random draws. For example, to find the density function of a binomial -random variable, use the command \function{dbinom()}. -The \function{qDIST()} function is the inverse of the -\function{pDIST()} function, for a given basename {\tt DIST}. -\begin{center} -\begin{tabular}{|c|c|c|} \hline -Distribution & NAME \\ \hline -Beta & {\tt beta} \\ -binomial & {\tt binom} \\ -Cauchy & {\tt cauchy} \\ -chi-square & {\tt chisq} \\ -exponential & {\tt exp} \\ -F & {\tt f} \\ -gamma & {\tt gamma} \\ -geometric & {\tt geom} \\ -hypergeometric & {\tt hyper} \\ -logistic & {\tt logis} \\ -lognormal & {\tt lnorm} \\ -negative binomial & {\tt nbinom} \\ -normal & {\tt norm} \\ -Poisson & {\tt pois} \\ -Student's t & {\tt t} \\ -Uniform & {\tt unif} \\ -Weibull & {\tt weibull} \\ \hline -\end{tabular} -\end{center} -\DiggingDeeper{The \function{fitdistr()} within the \pkg{MASS} package facilitates estimation -of parameter for many distributions.} -The \function{plotDist()} can be used to display distributions in a variety of ways. -<>= -plotDist('norm', params=list(mean=100, sd=10), - kind='cdf') -@ -<>= -plotDist('exp', kind='histogram') -@ -<>= -plotDist('binom', params=list(size=25, prob=0.25), - xlim=c(-1,26)) -@ -\begin{problem} -Generate a sample of 1000 exponential random variables with rate parameter -equal to 2, and calculate the mean of those variables. -\end{problem} -\begin{solution} -<>= -x <- rexp(1000, rate=2) -mean(x) -@ -\end{solution} - -\begin{problem} -Find the median of the random variable X, if it is exponentially distributed -with rate parameter 10. -\end{problem} -\begin{solution} -<>= -qexp(.5, rate=10) -@ -\end{solution} - - -\chapter{Power Calculations} -\label{chap:onesamppower} - -While not generally a major topic in introductory courses, power and sample size calculations -help to reinforce key ideas in statistics. In this section, we will explore how \R\ can -be used to undertake power calculations using analytic approaches. -We consider a simple problem with two tests (t-test and -sign test) of -a one-sided comparison. - -Let $X_1, ..., X_{25}$ be i.i.d. $N(0.3, 1)$ (this is the alternate that we wish to calculate power for). Consider testing the null hypothesis $H_0: \mu=0$ versus $H_A: \mu>0$ at significance level $\alpha=.05$. We will compare the power of the sign test and the power of the test based on normal theory (one sample one sided t-test) assuming that $\sigma$ -is known. - -\section{Sign test} - -We start by calculating the Type I error rate for the sign test. Here we want to -reject when the number of positive values is large. Under the null hypothesis, this is -distributed as a Binomial random variable with n=25 trials and p=0.5 probability of being -a positive value. Let's consider values between 15 and 19. -<>= -qbinom(.95, size=25, prob=0.5) -xvals <- 15:19 -probs <- 1 - pbinom(xvals, size=25, prob=0.5) -cbind(xvals, probs) -@ -So we see that if we decide to reject when the number of positive values is -17 or larger, we will have an $\alpha$ level of \Sexpr{round(1-pbinom(16, 25, 0.5), 3)}, -which is near the nominal value in the problem. - -We calculate the power of the sign test as follows. The probability that $X > 0$, given that $H_A$ is true is given by: -<>= -1 - pnorm(0, mean=0.3, sd=1) -@ -We can view this graphically using the command: -\begin{center} -<>= -xpnorm(0, mean=0.3, sd=1, lower.tail=FALSE) -@ -\end{center} -The power under the alternative is equal to the probability of getting 17 or more positive values, -given that $p=0.6179$: -<>= -1 - pbinom(16, size=25, prob=0.6179) -@ -The power is modest at best. - -\section{T-test} - -We next calculate the power of the test based on normal theory. To keep the comparison -fair, we will set our $\alpha$ level equal to 0.05388. -First we find the rejection region. - -<>= -alpha <- 1-pbinom(16, size=25, prob=0.5); alpha -n <- 25; sigma <- 1 # given -stderr <- sigma/sqrt(n) -zstar <- qnorm(1-alpha, mean=0, sd=1) -zstar -crit <- zstar*stderr -crit -@ -Therefore, we reject for observed means greater than \Sexpr{round(crit,3)}. - -To calculate the power of this one-sided test we find the probability -under the alternative hypothesis -to the right of this cutoff: -<<>>= -power <- 1 - pnorm(crit, mean=0.3, sd=stderr) -power -@ -Thus, the power of the test based on normal theory is \Sexpr{round(power,3)}. -To provide a check (or for future calculations of this sort) we can use the -\function{power.t.test()} function. -<<>>= -power.t.test(n=25, delta=.3, sd=1, sig.level=alpha, alternative="one.sided", -type="one.sample")$power -@ - -This yields a similar estimate to the value that we calculated directly. -Overall, we see that the t-test has higher power than the sign test, if the underlying -data are truly normal. \TeachingTip{It's useful to have students calculate power empirically, -to demonstrate the power of simulations.} -\begin{problem} -\label{prob:power1}% -Find the power of a two-sided two-sample t-test where both distributions -are approximately normally distributed with the same standard deviation, but the group differ by 50\% of the standard deviation. Assume that there are -\Sexpr{n} -observations per group and an alpha level of \Sexpr{alpha}. -\end{problem} -\begin{solution} -<>= -n <- 100 -alpha <- 0.01 -@ -<>= -n -alpha -power.t.test(n=n, delta=.5, sd=1, sig.level=alpha) -@ -\end{solution} -\begin{problem} -Find the sample size needed to have 90\% power for a two group t-test -where the true -difference between means is 25\% of the standard deviation in the groups -(with $\alpha=0.05$). -\end{problem} -\begin{solution} -<>= -power.t.test(delta=.25, sd=1, sig.level=alpha, power=0.90) -@ -\end{solution} - - -\chapter{Data Management} -\label{sec:manipulatingData}% - -Data management is a key capacity to allow students (and instructors) to ``compute with data'' or -as Diane Lambert has stated, ``think with data''. -We tend to keep student data management to a minimum during the early part of an introductory -statistics course, then gradually introduce topics as needed. For courses where students -undertake substantive projects, data management is more important. This chapter describes -some key data management tasks. - -\section{Adding new variables to a data frame} -We can add additional variables to an existing data frame by simple assignment. - -<>= -head(iris) -@ - -<>= -# cut places data into bins -iris <- transform(iris, Length = cut(Sepal.Length, 4:8)) -@ - -<<"mr-adding-variable2-again">>= -head(iris) -@ -%\Rindex{summary()} - - -The \dataframe{CPS85} data frame contains data from a Current Population Survey (current in 1985, that is). -Two of the variables in this data frame are \variable{age} and \variable{educ}. We can estimate -the number of years a worker has been in the workforce if we assume they have been in the workforce -since completing their education and that their age at graduation is 6 more than the number -of years of education obtained. We can this as a new variable in the data frame simply -by assigning to it: -<<>>= -CPS85 <- transform(CPS85, workforce.years = age - 6 - educ) -favstats(~ workforce.years, data=CPS85) -@ -In fact this is what was done for all but one of the cases to create the \variable{exper} -variable that is already in the \dataframe{CPS85} data. -<<>>= -with(CPS85, table(exper - workforce.years)) -@ - -\section{Dropping variables} -Since we already have \variable{educ}, there is no reason to keep our new variable. Let's drop it. -Notice the clever use of the minus sign. -<<>>= -names(CPS85) -CPS1 <- subset(CPS85, select = -workforce.years) -names(CPS1) -@ -Any number of variables can be dropped or kept in this manner by supplying a vectors -of variables names. -<<>>= -CPS1 <- subset(CPS85, select = -c(workforce.years, exper)) -@ - -If we only want to work with the first few variables, we can discard the rest in a similar way. -Columns can be specified by number as well as name (but this can be dangerous if you are wrong -about where the columns are): -<<>>= -CPSsmall <- subset(CPS85, select=1:4) -head(CPSsmall,2) -@ - -\section{Renaming variables} -Both the column (variable) names and the row names of a data frames can be changed by -simple assignment using \function{names()} or \function{row.names()}. -<<>>= -ddd <- data.frame(number=1:5, letter=letters[1:5]) -row.names(ddd) <- c("Abe","Betty","Claire","Don","Ethel") -ddd # row.names affects how a data.frame prints -@ -More interestingly, it is possible to reset just individual names with the following -syntax. -<<>>= -# misspelled a name, let's fix it -row.names(ddd)[2] <- "Bette" -row.names(ddd) -@ - -The \dataframe{faithful} data set (in the \pkg{datasets} package, which is always available) -has very unfortunate names. -\TeachingTip{It's a good idea to start teach good practices for choice of variable names from day one.} -<<>>= -names(faithful) -@ -The measurements are the duration of an euption and the time until the subsequent eruption, -so let's give it some better names. -<<>>= -names(faithful) <- c('duration', 'time.til.next') -head(faithful, 3) -@ -\begin{center} -<<"mr-faithful-xy">>= -xyplot(time.til.next ~ duration, alpha=0.5, data=faithful) -@ -\end{center} -If the variable containing a data frame is modified or used to store a different object, -the original data from the package can be recovered using \function{data()}. -<<>>= -data(faithful) -head(faithful, 3) -@ - -\begin{problem} -Using \dataframe{faithful} data frame, make a scatter plot of eruption duration times vs. the time -since the previous eruption. -\end{problem} - -If we want to rename a variable, we can do this using \function{names()}. -For example, perhaps we want to rename \variable{educ} (the second column) to \variable{education}. -<<>>= -names(CPS85)[2] <- 'education' -CPS85[1,1:4] -@ - -If we don't know the column number (or generally to make our code clearer), a few more -keystrokes produces: -<<>>= -names(CPS85)[names(CPS85) == 'education'] <- 'educ' -CPS85[1,1:4] -@ - -\section{Creating subsets} -\label{sec:subsets} -We can also use \function{subset()} to reduce the size of a data set by selecting -only certain rows. -\begin{center} -<<"mr-faithful-long-xy">>= -data(faithful) -names(faithful) <- c('duration', 'time.til.next') -# any logical can be used to create subsets -faithfulLong <- subset(faithful, duration > 3) -xyplot( time.til.next ~ duration, data=faithfulLong ) -@ -\end{center} - -Of course, if all we want to do is produce a graph, there is no reason to create -a new data frame. The plot above could also be made with: -\Caution{Unfortunately, not all functions in R support the \function{subset=} or -\option{data=} options.} -<>= -xyplot(time.til.next ~ duration, subset=duration > 3, - data=faithful) -@ - - - -\section{Sorting data frames} - -Data frames can be sorted using the \function{order()} function. -<<>>= -head(faithful, 3) -sorted = faithful[order(faithful$duration),] -head(sorted, 3) -@ - - - -\section{Merging datasets} - -The \dataframe{fusion1} data frame in the \pkg{fastR} package contains -genotype information for a SNP (single nucleotide polymorphism) in the gene -\emph{TCF7L2}. -The \dataframe{pheno} data frame contains phenotypes -(including type 2 diabetes case/control status) for an intersecting set of individuals. -We can merge these together to explore the association between -genotypes and phenotypes using \verb!merge()!. - -%\Rindex{merge()}% -<<>>= -require(fastR) -fusion1 = fusion1[order(fusion1$id),] -head(fusion1,3) -head(pheno,3) -@ - -<>= -# merge fusion1 and pheno keeping only id's that are in both -fusion1m <- merge(fusion1, pheno, by.x='id', by.y='id', - all.x=FALSE, all.y=FALSE) -head(fusion1m, 3) -@ -In this case, since the values are the same for each data frame, we could collapse -\option{by.x} and \option{by.y} to \option{by} and collapse -\option{all.x} and \option{all.y} to \option{all}. -The first of these specifies which column(s) to use to identify matching cases. -The second indicates whether cases in one data frame that do not appear in the other -should be kept (\code{TRUE}) or dropped -(filling in \code{NA} as needed) or dropped from the merged data frame. - -Now we are ready to begin our analysis. -<<"mr-fusion1-xtabs">>= -tally(~t2d + genotype, data=fusion1m) -@ - -\begin{problem} -The \dataframe{fusion2} data set in the \pkg{fastR} package contains genotypes for -another SNP. Merge \dataframe{fusion1}, \dataframe{fusion2}, and \dataframe{pheno} into a single data -frame. - -Note that \dataframe{fusion1} and \dataframe{fusion2} have the same columns. -<<>>= -names(fusion1) -names(fusion2) -@ -You may want to use the \option{suffixes} argument to \function{merge()} or rename the variables -after you are done merging to make the resulting data frame easier to navigate. - -Tidy up your data frame by dropping any columns that are redundant or that you just don't want to -have in your final data frame. -\end{problem} - -\section{Slicing and dicing} - -The \function{reshape()} function provides a flexible way to change the arrangement of data. -\Rindex{reshape()}% -It was designed for converting between long and wide versions of -time series data and its arguments are named with that in mind. - -A common situation is when we want to convert from a wide form to a -long form because of a change in perspective about what a unit of -observation is. For example, in the \dataframe{traffic} data frame, each -row is a year, and data for multiple states are provided. - -<<"mr-traffic-reshape">>= -traffic -@ -We can reformat this so that each row contains a measurement for a -single state in one year. - -<>= -longTraffic <- - reshape(traffic[,-2], idvar="year", - ids=row.names(traffic), - times=names(traffic)[3:6], - timevar="state", - varying=list(names(traffic)[3:6]), - v.names="deathRate", - direction="long") -head(longTraffic) -@ -And now we can reformat the other way, this time having all data for a given state -form a row in the data frame. -<>= -stateTraffic <- reshape(longTraffic, direction='wide', - v.names="deathRate", idvar="state", timevar="year") -stateTraffic -@ - -In simpler cases, \function{stack()} or \function{unstack()} may suffice. -The \pkg{Hmisc} package also provides \function{reShape()} as an alternative -to \function{reshape()}. -\Rindex{stack()}% -\Rindex{unstack()}% - -\section{Derived variable creation} - -A number of functions help facilitate the creation or recoding of variables. - -\subsection{Creating categorical variable from a quantitative variable} - -Next we demonstrate how to -create a three-level categorical variable -with cuts at 20 and 40 for the CESD scale (which ranges from 0 to 60 points). -<>= -favstats(~ cesd, data=HELPrct) -HELPrct = transform(HELPrct, cesdcut = cut(cesd, - breaks=c(0, 20, 40, 60), include.lowest=TRUE)) -bwplot(cesd ~ cesdcut, data=HELPrct) -@ - -\subsection{Reordering factors} -By default R uses the first level in lexicographic order as the reference group for modeling. This -can be overriden using the \function{relevel()} function (see also \function{reorder()}). -<>= -tally(~ substance, data=HELPrct) -coef(lm(cesd ~ substance, data=HELPrct)) -HELPrct = transform(HELPrct, subnew = relevel(substance, - ref="heroin")) -coef(lm(cesd ~ subnew, data=HELPrct)) -@ - -\section{Accounting for missing data} -\label{sec:miss} - -Missing values arise in almost all real world investigations. R uses the \variable{NA} character as an -indicator for missing data. The \dataframe{HELPmiss} dataframe within the \pkg{mosaic} package includes all -$n=470$ subjects enrolled at baseline (including the $n=17$ subjects with some missing data who -were not included in \dataframe{HELPrct}). -<>= -smaller = subset(HELPmiss, select=c("cesd", "drugrisk", - "indtot", "mcs", "pcs", "substance")) -dim(smaller) -summary(smaller) -favstats(~ mcs, data=smaller) -with(smaller, sum(is.na(mcs))) -nomiss = na.omit(smaller) -dim(nomiss) -favstats(~ mcs, data=nomiss) -@ - -\chapter{Health Evaluation and Linkage to Primary Care (HELP) Study} - -\label{sec:help} - -Many of the examples in this guide utilize data from the HELP study, -a randomized clinical trial for adult inpatients recruited from a detoxification unit. -Patients with no primary care physician were randomized to receive a multidisciplinary assessment and a brief motivational intervention or usual care, -with the goal of linking them to primary medical care. -Funding for the HELP study was provided by the National Institute -on Alcohol Abuse and Alcoholism (R01-AA10870, Samet PI) and -National Institute on Drug Abuse (R01-DA10019, Samet PI). -The details of the -randomized trial along with the results from a series of additional analyses have been published\cite{same:lars:hort:2003,lieb:save:2002,kert:hort:frie:2003}. - -Eligible subjects were -adults, who spoke Spanish or English, reported alcohol, heroin or -cocaine as their first or second drug of choice, resided in proximity -to the primary care clinic to which they would be referred or were -homeless. Patients with established primary care relationships -they planned to continue, significant dementia, specific plans to -leave the Boston area that would prevent research participation, -failure to provide contact information for tracking purposes, or -pregnancy were excluded. - -Subjects were interviewed at baseline during -their detoxification stay and follow-up interviews were undertaken -every 6 months for 2 years. A variety of continuous, count, discrete, and survival time predictors and outcomes were collected at each of these five occasions. -The Institutional Review Board of -Boston University Medical Center approved all aspects of the study, including the creation of the de-identified dataset. Additional -privacy protection was secured by the issuance of a Certificate of -Confidentiality by the Department of Health and Human Services. - -The \pkg{mosaic} package contains several forms of the de-identified HELP dataset. -We will focus on \pkg{HELPrct}, which contains -27 variables for the 453 subjects -with minimal missing data, primarily at baseline. -Variables included in the HELP dataset are described in Table \ref{tab:helpvars}. More information can be found here\cite{Horton:2011:R}. -A copy of the study instruments can be found at: \url{http://www.amherst.edu/~nhorton/help}. -\begin{longtable}{|p{2.1cm}|p{6.8cm}|p{3.5cm}|} -\caption{Annotated description of variables in the \dataframe{HELPrct} dataset} -\label{tab:helpvars} \\ -\hline -VARIABLE & DESCRIPTION (VALUES) & NOTE \\ \hline -\variable{age} & age at baseline (in years) (range 19--60) & \\ \hline -\variable{anysub} & use of any substance post-detox & see also \variable{daysanysub} -\\ \hline -\variable{cesd} & Center for Epidemiologic Studies Depression scale (range 0--60, higher scores indicate more depressive symptoms) & \\ \hline -\variable{d1} & how many times hospitalized for medical problems (lifetime) (range 0--100) & \\ \hline -\variable{daysanysub} & time (in days) to first use of any substance post-detox (range 0--268) & see also \variable{anysubstatus} \\ \hline -\variable{dayslink} & time (in days) to linkage to primary care (range 0--456) & see also \variable{linkstatus} -\\ \hline -\variable{drugrisk} & Risk-Assessment Battery (RAB) drug risk score (range 0--21) & see also \variable{sexrisk} -\\ \hline -\variable{e2b} & number of times in past 6 months entered a detox program (range 1--21) & \\ \hline -\variable{female} & gender of respondent (0=male, 1=female) & -\\ \hline -\variable{g1b} & experienced serious thoughts of suicide (last 30 days, values 0=no, 1=yes) & -\\ \hline -\variable{homeless} & 1 or more nights on the street or shelter in past 6 months (0=no, 1=yes) & -\\ \hline -\variable{i1} & average number of drinks (standard units) consumed per day (in the past 30 days, range 0--142) & see also \variable{i2} -\\ \hline -\variable{i2} & maximum number of drinks (standard units) consumed per day (in the past 30 days range 0--184) & see also \variable{i1} -\\ \hline -\variable{id} & random subject identifier (range 1--470) & -\\ \hline -\variable{indtot} & Inventory of Drug Use Consequences (InDUC) total score (range 4--45) & -\\ \hline -\variable{linkstatus} & post-detox linkage to primary care (0=no, 1=yes) & see also \variable{dayslink} -\\ \hline -\variable{mcs} & SF-36 Mental Component Score (range 7-62, higher scores are better) & see also \variable{pcs} -\\ \hline -\variable{pcs} & SF-36 Physical Component Score (range 14-75, higher scores are better) & see also \variable{mcs} -\\ \hline -\variable{pss\_fr} & perceived social supports (friends, range 0--14) & -\\ \hline -\variable{racegrp} & race/ethnicity (black, white, hispanic or other) & \\ \hline -\variable{satreat} & any BSAS substance abuse treatment at baseline (0=no, 1=yes) & \\ \hline -\variable{sex} & sex of respondent (male or female) & \\ \hline -\variable{sexrisk} & Risk-Assessment Battery (RAB) sex risk score (range 0--21) & see also \variable{drugrisk} -\\ \hline -\variable{substance} & primary substance of abuse (alcohol, cocaine or heroin) & -\\ \hline -\variable{treat} & randomization group (randomize to HELP clinic, no or yes) & -\\ \hline -\end{longtable} -\noindent -Notes: Observed range is provided (at baseline) for continuous variables. - - -\chapter{Exercises and Problems} - -\shipoutProblems - -\bibliographystyle{alpha} -\bibliography{../include/USCOTS} diff --git a/Interweb/Master-Internet.Rnw b/Interweb/Master-Internet.Rnw deleted file mode 100644 index 28d18d8..0000000 --- a/Interweb/Master-Internet.Rnw +++ /dev/null @@ -1,41 +0,0 @@ - - -\documentclass[open-any,12pt]{tufte-book} - -\usepackage{../include/RBook} -\usepackage{pdfpages} -%\usepackage[shownotes]{authNote} -\usepackage[hidenotes]{authNote} - -\def\tilde{\texttt{\~}} - -\title{Mining the Internet to Teach Statistics with R} -\author{Randall J.\,Pruim, Nicholas J.\,Horton,and\\Daniel T. Kaplan} -\date{DRAFT: \today} - -\renewenvironment{knitrout}{\relax}{\noindent} -\begin{document} - - -%\maketitle - -\includepdf{USCOTS-cover} - -\newpage - -\tableofcontents - -\newpage - -<>= -..makingMaster.. <- TRUE -@ - -<>= -@ - -<>= -@ - - -\end{document} diff --git a/Interweb/internet-cover.pptx b/Interweb/internet-cover.pptx deleted file mode 100644 index 7206c23..0000000 Binary files a/Interweb/internet-cover.pptx and /dev/null differ diff --git a/Interweb/makefile b/Interweb/makefile deleted file mode 100644 index d367dc2..0000000 --- a/Interweb/makefile +++ /dev/null @@ -1,11 +0,0 @@ -all: Master-Internet.pdf - -Master-Internet.pdf: Master-Internet.tex - pdflatex Master-Internet - bibtex Master-Internet - pdflatex Master-Internet - -Master-Internet.tex: Master-Internet.Rnw Internet.Rnw - knitr Master-Internet.Rnw - - diff --git a/Materials/Housing/UntanglingHousesInstructorNotes.Rmd b/Materials/Housing/UntanglingHousesInstructorNotes.Rmd deleted file mode 100644 index 67c324a..0000000 --- a/Materials/Housing/UntanglingHousesInstructorNotes.Rmd +++ /dev/null @@ -1,63 +0,0 @@ -Instructor Notes: Untangling House Prices -======================================================== - -```{r error=FALSE,warning=FALSE,message=FALSE,results="hide",echo=FALSE} -require(mosaic,quietly=TRUE) -trellis.par.set(theme=col.mosaic()) -``` - -The [student handout](UntanglingHousePrices.pdf) contains some background on the data. That handout is presented as a contrast between analysis using one explanatory variable at a time and analysis using more than one explanatory variable simultaneously. - -Students do not yet have the theory that they will need to understand how and why fitting a model with more than one explanatory variable simultaneous works. This activity is meant to set up the question, "Why should I believe one result rather than another?" In the student handout, that question is addressed by an appeal to authority: look to see what realtors and developers have to say about the value of a fireplace. Not very satisfactory, scientifically. After all, how do they know. - -There are some other aspects of statistical modeling that the housing price setting can illuminate: -* Dealing with non-normal distributions by transformation -* "Untangling" of influences. -* Interaction terms - -Non-Normal Distributions ------------------------- -The price (and living area) variables have a right-skew form. In order to avoid the influence of the cases on the tails, and in order to render price into a more meaningful form, a log transformation can be appropriate. - -In addition to changing the shape of the distribution, the log transformation changes the meaning of price into a proportional variable --- an increase in log-price by one unit corresponds to a constant proportion change in price rather than a constant dollar change in price. - -```{r} -houses = fetchData("SaratogaHouses.csv") -densityplot( ~ Price, data=houses ) -densityplot( ~ log(Price), data=houses ) -``` - -Untangling ----------- - -The figure shows a simple, theoretical and schematic diagram of some of the factors influencing house prices. - -![House Price Influences](HousePriceInfluences.png) - -The quantities measured in the Saratoga house price dataset are drawn in boxes. Other factors, not measured, are shown in ovals. - -The idea behind this diagram is that there are two background influences at work. (Of course, there may be many more! Make your own theory!) The background influences in this theory are -* Family Size -* Wealth - -These two factors may or may not be connected, depending on whether you think family size influences wealth and/or vice versa. That possible connection is shown by a double-headed arrow with question marks. - -The other arrows show other, plausible connections. -* A bigger family wants more living space and more bedrooms and baths. -* A wealthier family wants more of those things too. But they also want a higher "quality" house, including for example a fireplace. - -The immediate question at hand in the student handout is how the presence of a fireplace influences price. It seems obvious that the way to study this is to examine the relationship between "Fireplace" and "Price", the arrow marked as "Question at Hand." - -Note that there are many pathways between "Fireplace" and "Price": quite a tangled network of influences. When you examine the relationship between two variables, you are studying all of the possible paths. (The analysis of such paths, and their use to select covariates in a model, is the subject of Chapter 17.) In particular, many of the paths from fireplace to price go through "Wealth" and "Quality". If we want to study the influence of a fireplace on price, we need to untangle these various influences. - -Interaction ------------ - -There's a temptation when building a model for an untrained audience to stick to a "main effects" form, without interaction terms. Such a model makes it easy to say, "This is the effect of variable A, and this is the effect of variable B." In the house price example, this style shows up when saying, "A second bathroom is worth this much, an additional bedroom is worth this much, and an additional square foot of living area is worth this much." That may well be a fine approximation for the purposes of conveying information simply. - -In the house price example, however, there is a strong potential for an interaction. The number of bedrooms is not independent of the living area. You can add a bedroom to a house by taking away some of the common living space. Or you can add a bedroom by dividing in half an existing bedroom. Neither of these approaches to adding a bedroom could be expected to increase the price of a house as much as adding a bedroom that contains additional space. - -This suggests that there should be an interaction between living space and number of bedrooms: the effect on price of an additional bedroom depends on how much living space there is. - -You can explore this with the `vLM` program, watching how $R^2$ changes as you include and exclude the interaction terms. - diff --git a/Materials/Housing/poly2d.tex b/Materials/Housing/poly2d.tex deleted file mode 100644 index 5e7bba3..0000000 --- a/Materials/Housing/poly2d.tex +++ /dev/null @@ -1,149 +0,0 @@ -Our general purpose tool for constructing local models of functions of two -variables is the polynomial. The point of constructing such models -isn't to capture exactly every aspect of the relationship, but to -build a scaffolding that can be used to analyze and interpret data, -hopefully leading to a better description of the relationship. - -We imagine that there is an output that is a function of two inputs: -$f(x,y)$. The polynomial function that we will use will be -$$ f(x,y) = a_0 + a_1 x + a_2 y + a_3 x y + a_4 x^2 + a_5 y^2$$ - -Note that the six parameters have been subscripted with a number. This is just for -convenience in referring to them. You can call them ``a naught'', ``a -one,'' ``a two'', and so on. Whatever the names, each of them is just -a scalar. - -Depending on the values of the parameters $a_0, a_1, a_2, a_3, a_4, -a_5$, this function can take on all sorts of shapes. But, in general, -{\bf not all of the terms are needed}. - - - -\begin{description} - -\item[$a_0$] The {\bf constant term}. This sets a typical value of - $f(x,y)$, but doesn't depend on either $x$ or $y$. It is almost - always included by default. - - -\item[$a_1 x$] The {\bf linear term in $x$}. - Produces a simple dependence on the input $x$; if the - input $x$ changes, then the output $f(x,y)$ will change. - -\item[$a_2 y$] Likewise, the {\bf linear term in $y$}. This produces a simple dependence on the input $y$. - -\item[$a_4 x^2$] The {\bf quadratic term in $x$} can do two things. It is - absolutely needed in the model if there is a maximum or minimum with - respect to $x$. But, even if there is no extremum, if there is an - important change in $\frac{\Delta f}{\Delta x}$ as $x$ changes, then - there should be this quadratic term. Example: economists often - speak of diminishing marginal returns --- doubling the amount of - investment doesn't lead to a doubling in output per dollar of - investment. - -\item[$a_5 y^2$] The {\bf quadratic term in $y$}. Like the quadratic term in $x$, it's needed for there to be an extremum with - respect to $y$, or a change in $\frac{\Delta f}{\Delta y}$. - -\item[$a_3 x y $] The {\bf interaction term}. This term expresses how the - inputs $x$ and $y$ interact: perhaps interfering with one another or - reinforcing one another. Whenever the output will depend on $x$ - differently for different values of $y$, or vice versa, there should - be an interaction term included in the model. - -\end{description} - - -Almost always, we include the constant and linear terms in a model, -although we might discover that they are not needed if other terms are -added. The question is generally whether to include the quadratic and -bilinear terms. - -In order to decide which of these terms to include in a model -$f(x,y)$, it helps to ask the following questions about the quadratic terms and interaction terms: - -\begin{enumerate} - -\item Is there an extremum with respect to $x$? That is, holding $y$ - fixed, is there a value of $x$ at which $f(x,y)$ takes on a maximum - or minimum value? If there is, you will want to include the - quadratic term in $x$. - - -\item If there is an extremum with respect to $x$, does its position - or magnitude depend on the value of $y$? If so, include the - interaction term. - -\item If there isn't an extremum with respect to $x$, does the slope - with respect to $x$ depend on $y$? If so, include the interaction term even though there isn't a quadratic term in $x$. - -\item The same questions should be asked with respect to $y$ to decide whether to include the quadratic term in $y$. - -\item Both $x$ and $y$ participate in the interaction term, but sometimes one of the variables gives you a clearer indication that an interaction is important. Include it if warranted for {\bf either} of the variables $x$ and $y$. - - - - - -\end{enumerate} - - -Decide which terms should be included in local models in these -situations: - -\begin{description} - -\item[Bicycle speed] A bicycle's speed $V$ depends on both the - steepness $S$ of the terrain and the gear ratio $G$ for the bicycle. - Assume that the gear ratio is a number between 1 and 6, and let the - steepness be measured in percent (positive for uphill, negative for - downhill). What terms should be included in $V(S,G)$? - -\item[Economic production] The output of a factory, $P$, depends both - on the amount of capital $C$ and the amount of labor $L$. What - terms should be included in $P(C,L)$? - -\item[Infectious disease] The number of people $N$ who get an illness such - as the flu depends on both the number of people who already have the - illness $I$, and the number who are susceptible $S$. What terms - should be included in $N(S,I)$? - -\item[Survival of chicks] The number of surviving fledglings $F$ of a - mother bird depends on the number of eggs $N$ that are laid and the time - that the mother spends collecting food $T$. What terms should be - included in $F(N,T)$? - -\item[Day length] The length of daylight $D$ depends on both the time - of the year $M$ (for month) and the latitude $L$. What terms should - be included in $D(M,L)$? - -\item[Growth of a crop] The yield $Y$ of a crop (bushels/acre) depends - both on the amount of water applied ($W$, inches/acre) and the amount of - fertilizer ($F$, lbs/acre). What terms should be included in - $Y(W,F)$ - -\item[Probability of admission to college] The probability $P$ that an - applicant will be admitted to college depends on many things, but we - will restrict consideration here to the math $M$ and verbal $V$ - scores on an entrance examination such as the ACT or SAT. What - terms should be included in $P(M,V)$? - -\item[School effectiveness] The effectiveness $E$ of an elementary - school (perhaps as measured imperfectly by standardized tests) - depends on both the qualifications of the teachers and the class - size $S$. We'll crassly measure the across-the-district teacher - qualification with the average teacher pay, $P$. What terms should - be included in $E(S,P)$? - -\end{description} - -We can generalize this approach to more than two variables. Here is a -function of (at least) three-variables. - -\begin{description} - -\item[Probability of a heart attack] The probability of a heart attack - $p$ as a function of age $A$ and amount of exercise $E$, and number - of calories in the diet $C$. What terms should be included in $p(A,E,C)$ - -\end{description} - diff --git a/Modeling/Master-Modeling.Rnw b/Modeling/MOSAIC-Modeling.Rnw similarity index 100% rename from Modeling/Master-Modeling.Rnw rename to Modeling/MOSAIC-Modeling.Rnw diff --git a/ModelingV2/Master-Modeling.Rnw b/ModelingV2/MOSAIC-Modeling.Rnw similarity index 100% rename from ModelingV2/Master-Modeling.Rnw rename to ModelingV2/MOSAIC-Modeling.Rnw diff --git a/README.Rmd b/README.Rmd index 5500afb..132d1bf 100644 --- a/README.Rmd +++ b/README.Rmd @@ -1,152 +1,37 @@ -TeachStatsWithR +--- +--- +Project MOSAIC Little Books =============== ```{r include=FALSE} require(mosaic) ``` -Materials for the MOSAIC "Teaching Statistics with R and RStudio" +You can grab PDFs of the Little Books here: -### Compiling - -The content of each book is in a separate directory. That directory has a subdirectory, `Master` with a file `Master-*.Rnw`. - -Each of the `.Rnw` files in the content directory can be compiled on it's own. Just "Knit HTML" in RStudio. This will create a PDF file. - -To create the whole book, you need to recompile each `.Rnw` after setting a variable, `notAsStandAlone=TRUE`. The process is -```{r eval=FALSE} -require(knitr) -# Change to the directory, e.g. "Starting" -setwd("Starting") # go to your own directory -standAlone=TRUE -fnames <- list.files(pattern="*.Rnw$") # files to recompile -for (fnm in fnames) knit(fnm) -``` - -### Outline - -This project consists of several short books that are inter-related. - -1. Start Teaching with R - Directory: `Starting` -2. The Core of a Traditional Course - Directory: `Traditional` -3. Simulation-Based Inference - Directory: `Simulation` -4. Functions and Formulas - Directory: `Functions` -5. Teaching with Internet Services - Directory: `Internet` -6. Start with Modeling - Directory: `Modeling` + * *Start Teaching Statistics Using R* + [[view]](Starting/MOSAIC-StartTeaching.pdf) + [[download]](../../raw/master/Starting/MOSAIC-StartTeaching.pdf) + + This book presents instructors with an overview of our approach to + teaching statistics with R and an introduction to our primary R toolkit. + + * *A Student's Guide to R* + [[view]](StudentGuide/MOSAIC-StudentGuide.pdf) + [[download]](../../raw/master/StudentGuide/MOSAIC-StudentGuide.pdf) + + This book is organized by analysis method and demonstrates how to perform + all of the statistical analyses typically covered in an Intro Stats course. + It can serve a good reference for both students and faculty. -APPENDICES - -A. Style instructions for authors -B. Possible additional topics - -Outlines for the individual chapters are in the `Outline.Rmd` file in each directory. + This was formerly known as *A Compendium of Commands to Teach Statistics With R* but has been reworked to make it more student friendly. -## Overview - -Some general comments about the project as a whole. - -```{r starting-outline,child='Starting/Outline-Starting.Rmd',eval=TRUE} -``` - -```{r traditional-outline,child='Traditional/Outline-Traditional.Rmd',eval=TRUE} -``` - -```{r simulation-outline,child='Simulation/Outline-Simulation.Rmd',eval=TRUE} -``` - -```{r functions-outline,child='Functions/Outline-Functions.Rmd',eval=TRUE} -``` - -```{r internet-outline,child='Internet/Outline-Internet.Rmd',eval=TRUE} -``` - -```{r modeling-outline,child='Modeling/Outline-Modeling.Rmd',eval=TRUE} -``` - - -Appendix A: Style Instructions for Authors -======================= - - -Notes for the authors can be included using `\authNote{A note to the authors.}` - -Processed notes for the authors can be hidden using `\authNoted{A noted note to the authors.}` - -## Some Style Guidelines - -1. R Code - 1. Use space after comma in argument lists - 2. No space around = in argument list - 3. Use space around operators, `<-` and `->` - 4. Casual comments (no need for caps) - 5. When referring to functions in the text, add empty parens (e.g., `data()`) to make it clear that the object is a function. -2. Exercises - N.B. Some exercises are for instructors, not for students. - 1. Use `\begin{problem} ... \end{problem}` to define problems. - 2. Use `\begin{solution} ... \end{solution}` to define solutions. - -This must be \emph{outside} the `problem` environment and before the definition of the next problem. Put it immediately after `\end{problem}` to avoid confusion. - -* Use `\shipoutProblems` to display all problems queued up since the last `shipoutProblems`. -* Examples - Put within `\begin{example}` and `end{example}`. We can tweak the formatting later. - -* Marginal Notes - We can place some marginal notes with: - * `\InstructorNote{This is an instructor note.}` - * `\FoodForThought{We can tweak the layout, color, size, etc. later. For now. I'm just using color to distinguish.}` - * `\Caution{This is a caution}` - - -1. Variable names. Often it's nice to distinguish between anactual variable name and a word that might have a similar name, for instance between sex and `sex`. Use the `\VN{sex}` command to accomplish this. -1. Model formulas. Use `\model{A}{B+C}` to generate `A ~ B+C` . Often, you may want to use variable names, for instance `\model{\VN{height}}{\VN{age}+\VN{sex}}` gives `height ~ age + sex`. + * *Start R in Calculus* [[Amazon]](http://www.amazon.com/Start-Calculus-Daniel-T-Kaplan/dp/0983965897) + This book describes the use of R in calculus based on the successful + redesign of the first semester calculus course at Macalester College. + +Others that are in progress include + + * Start Modeling with R - - -### R-forge svn - -The mosaic R-forge repository contains - -* The `.Rnw` files -* The dependencies in `bin/` -* *LaTeX* dependencies in `inputs/` - * `problems.sty` (for problems and solutions) - * `authNote.sty` (for author notes) - * `probstat.sty` (for some prob/stat macros) - * `sfsect.sty` (for san serif section title fonts) - -You may need to set an environment variable to make *LaTeX* look here. -* the `cache/` and `figures/` directories (so that `make` can be used -without complaint), but *not* their contents, which are generated by `sweave`. -* screenshots and other images not generated by `sweave` are in `images/` - -Appendix B: Possible Additional Topics -===================== - -Do we want to include any of these topics? -* Fancier Lattice Graphics -* Base Graphics -* Making plots with `ggplots2` - Only if one of us knows or wants to learn this system. -* Writing executable R scripts -* R Infrastructure for Teaching - Whatever of this we include might end up in the chapters rather than in -an appendix. -* Sharing in R Studio -* Public Data -* Google Data -* Making Data Available Online -* A Brief Tour of knitr and R-markdown -* exams -* Books - * Our books - * Chance et al (in progress) - * Existing books that work well/poorly with R (and why) -* Online materials diff --git a/README.html b/README.html index 4e7cdd3..52980fb 100644 --- a/README.html +++ b/README.html @@ -1,379 +1,105 @@ - - - - - -TeachStatsWithR - - + + + + + + - + + + - - - -

TeachStatsWithR

- -

Materials for the MOSAIC “Teaching Statistics with R and RStudio”

- -

Compiling

- -

The content of each book is in a separate directory. That directory has a subdirectory, Master with a file Master-*.Rnw.

- -

Each of the .Rnw files in the content directory can be compiled on it's own. Just “Knit HTML” in RStudio. This will create a PDF file.

-

To create the whole book, you need to recompile each .Rnw after setting a variable, notAsStandAlone=TRUE. The process is

- -
require(knitr)
-# Change to the directory, e.g. 'Starting'
-setwd("Starting")  # go to your own directory
-standAlone = TRUE
-fnames <- list.files(pattern = "*.Rnw$")  # files to recompile
-for (fnm in fnames) knit(fnm)
-
- -

Outline

- -

This project consists of several short books that are inter-related.

- -
    -
  1. Start Teaching with R -Directory: Starting
  2. -
  3. The Core of a Traditional Course -Directory: Traditional
  4. -
  5. Simulation-Based Inference -Directory: Simulation
  6. -
  7. Functions and Formulas -Directory: Functions
  8. -
  9. Teaching with Internet Services -Directory: Internet
  10. -
  11. Start with Modeling -Directory: Modeling
  12. -
- -

APPENDICES

- -

A. Style instructions for authors -B. Possible additional topics

- -

Outlines for the individual chapters are in the Outline.Rmd file in each directory.

- -

Overview

- -

Some general comments about the project as a whole.

- -

Start Teaching with R: Outline

- -

Random list of things to include:

- -
    -
  • RStudio introduction
  • -
  • Packages and why to use them
  • -
  • The mosaic package and others that we recommend
  • -
  • Minimal R reference
  • -
  • Rmd
  • -
  • Umbrella for other books
  • -
  • Error messages. Things to look for in error messages. - -
      -
    • overwriting names
    • -
    • recycling vectors
    • -
    • factors and characters
    • -
  • -
  • Data - -
      -
    • data()
    • -
    • Group data with Google forms
    • -
    • fetchData() and requesting a repository name
    • -
    • distributing other kinds of files, e.g. templates for .Rmd, scripts, …
    • -
  • -
  • Style - -
      -
    • Don't use one-letter names in your examples. Your students will pick this up and end up saying things like c = makeFun(x~x), which will mask base::c() and mess everything up until you remove(c)
    • -
  • -
- -

R Core for a Traditional Course: Outline

- -

Simulation-Based Inference: Outline

- -

Functions and Formulas: Outline

- -

Using Internet Services: Outline

- -

Start Modeling Early: Outline

- -

Appendix A: Style Instructions for Authors

- -

Notes for the authors can be included using \authNote{A note to the authors.}

- -

Processed notes for the authors can be hidden using \authNoted{A noted note to the authors.}

- -

Some Style Guidelines

- -
    -
  1. R Code - -
      -
    1. Use space after comma in argument lists
    2. -
    3. No space around = in argument list
    4. -
    5. Use space around operators, <- and ->
    6. -
    7. Casual comments (no need for caps)
    8. -
    9. When referring to functions in the text, add empty parens (e.g., data()) to make it clear that the object is a function.
    10. -
  2. -
  3. Exercises -N.B. Some exercises are for instructors, not for students. + +
    -
      -
    1. Use \begin{problem} ... \end{problem} to define problems.
    2. -
    3. Use \begin{solution} ... \end{solution} to define solutions.
    4. -
  4. -
-

This must be \emph{outside} the problem environment and before the definition of the next problem. Put it immediately after \end{problem} to avoid confusion.

-
    -
  • Use \shipoutProblems to display all problems queued up since the last shipoutProblems.
  • -
  • Examples -Put within \begin{example} and end{example}. We can tweak the formatting later.

  • -
  • Marginal Notes -We can place some marginal notes with:

    +
    +

    Project MOSAIC Little Books

    +

    You can grab PDFs of the Little Books here:

      -
    • \InstructorNote{This is an instructor note.}
    • -
    • \FoodForThought{We can tweak the layout, color, size, etc. later. For now. I'm just using color to distinguish.}
    • -
    • \Caution{This is a caution}
    • -
  • +
  • Start Teaching Statistics Using R [view] [download]

    +

    This book presents instructors with an overview of our approach to teaching statistics with R and an introduction to our primary R toolkit.

  • +
  • A Student’s Guide to R [view] [download]

    +

    This book is organized by analysis method and demonstrates how to perform all of the statistical analyses typically covered in an Intro Stats course. It can serve a good reference for both students and faculty.

    +

    This was formerly known as A Compendium of Commands to Teach Statistics With R but has been reworked to make it more student friendly.

  • +
  • Start R in Calculus [Amazon]

- -
    -
  1. Variable names. Often it's nice to distinguish between anactual variable name and a word that might have a similar name, for instance between sex and sex. Use the \VN{sex} command to accomplish this.
  2. -
  3. Model formulas. Use \model{A}{B+C} to generate A ~ B+C . Often, you may want to use variable names, for instance \model{\VN{height}}{\VN{age}+\VN{sex}} gives height ~ age + sex.
  4. -
- -

R-forge svn

- -

The mosaic R-forge repository contains

- -
    -
  • The .Rnw files
  • -
  • The dependencies in bin/
  • -
  • LaTeX dependencies in inputs/
    - +

    This book describes the use of R in calculus based on the successful redesign of the first semester calculus course at Macalester College.

    +

    Others that are in progress include

      -
    • problems.sty (for problems and solutions)
    • -
    • authNote.sty (for author notes)
    • -
    • probstat.sty (for some prob/stat macros)
    • -
    • sfsect.sty (for san serif section title fonts)
    • -
  • +
  • Start Modeling with R
+ -

You may need to set an environment variable to make LaTeX look here.

-
    -
  • the cache/ and figures/ directories (so that make can be used -without complaint), but not their contents, which are generated by sweave.
  • -
  • screenshots and other images not generated by sweave are in images/
  • -
+ -

Appendix B: Possible Additional Topics

+ -
    -
  • Our books
  • -
  • Chance et al (in progress)
  • -
  • Existing books that work well/poorly with R (and why)
  • -
-
  • Online materials
  • - + + - - diff --git a/README.md b/README.md index 308a167..a74a618 100644 --- a/README.md +++ b/README.md @@ -1,173 +1,33 @@ -TeachStatsWithR +--- +--- +Project MOSAIC Little Books =============== +You can find PDFs of the Little Books here: -Materials for the MOSAIC "Teaching Statistics with R and RStudio" - -### Compiling - -The content of each book is in a separate directory. That directory has a subdirectory, `Master` with a file `Master-*.Rnw`. - -Each of the `.Rnw` files in the content directory can be compiled on it's own. Just "Knit HTML" in RStudio. This will create a PDF file. - -To create the whole book, you need to recompile each `.Rnw` after setting a variable, `notAsStandAlone=TRUE`. The process is - -```r -require(knitr) -# Change to the directory, e.g. 'Starting' -setwd("Starting") # go to your own directory -standAlone = TRUE -fnames <- list.files(pattern = "*.Rnw$") # files to recompile -for (fnm in fnames) knit(fnm) -``` - - -### Outline - -This project consists of several short books that are inter-related. - -1. Start Teaching with R - Directory: `Starting` -2. The Core of a Traditional Course - Directory: `Traditional` -3. Simulation-Based Inference - Directory: `Simulation` -4. Functions and Formulas - Directory: `Functions` -5. Teaching with Internet Services - Directory: `Internet` -6. Start with Modeling - Directory: `Modeling` + * *Start Teaching Statistics Using R* + [[view]](Starting/MOSAIC-StartTeaching.pdf) + [[download]](../../raw/master/Starting/MOSAIC-StartTeaching.pdf) + + This book presents instructors with an overview of our approach to + teaching statistics with R and an introduction to our primary R toolkit. + + * *A Student's Guide to R* + [[view]](StudentGuide/MOSAIC-StudentGuide.pdf) + [[download]](../../raw/master/StudentGuide/MOSAIC-StudentGuide.pdf) + + This book is organized by analysis method and demonstrates how to perform + all of the statistical analyses typically covered in an Intro Stats course. + It can serve a good reference for both students and faculty. -APPENDICES - -A. Style instructions for authors -B. Possible additional topics - -Outlines for the individual chapters are in the `Outline.Rmd` file in each directory. - -## Overview - -Some general comments about the project as a whole. - -## Start Teaching with R: Outline - - -Random list of things to include: -* RStudio introduction -* Packages and why to use them -* The mosaic package and others that we recommend -* Minimal R reference -* Rmd -* Umbrella for other books -* Error messages. Things to look for in error messages. - * overwriting names - * recycling vectors - * factors and characters -* Data - * `data()` - * Group data with Google forms - * `fetchData()` and requesting a repository name - * distributing other kinds of files, e.g. templates for .Rmd, scripts, ... -* Style - * Don't use one-letter names in your examples. Your students will pick this up and end up saying things like `c = makeFun(x~x)`, which will mask `base::c()` and mess everything up until you `remove(c)` - - -## R Core for a Traditional Course: Outline - - -## Simulation-Based Inference: Outline - + This was formerly known as *A Compendium of Commands to Teach Statistics With R* but has been reworked to make it more student friendly. -## Functions and Formulas: Outline - - -## Using Internet Services: Outline - - -## Start Modeling Early: Outline - - - -Appendix A: Style Instructions for Authors -======================= - - -Notes for the authors can be included using `\authNote{A note to the authors.}` - -Processed notes for the authors can be hidden using `\authNoted{A noted note to the authors.}` - -## Some Style Guidelines - -1. R Code - 1. Use space after comma in argument lists - 2. No space around = in argument list - 3. Use space around operators, `<-` and `->` - 4. Casual comments (no need for caps) - 5. When referring to functions in the text, add empty parens (e.g., `data()`) to make it clear that the object is a function. -2. Exercises - N.B. Some exercises are for instructors, not for students. - 1. Use `\begin{problem} ... \end{problem}` to define problems. - 2. Use `\begin{solution} ... \end{solution}` to define solutions. - -This must be \emph{outside} the `problem` environment and before the definition of the next problem. Put it immediately after `\end{problem}` to avoid confusion. - -* Use `\shipoutProblems` to display all problems queued up since the last `shipoutProblems`. -* Examples - Put within `\begin{example}` and `end{example}`. We can tweak the formatting later. - -* Marginal Notes - We can place some marginal notes with: - * `\InstructorNote{This is an instructor note.}` - * `\FoodForThought{We can tweak the layout, color, size, etc. later. For now. I'm just using color to distinguish.}` - * `\Caution{This is a caution}` - - -1. Variable names. Often it's nice to distinguish between anactual variable name and a word that might have a similar name, for instance between sex and `sex`. Use the `\VN{sex}` command to accomplish this. -1. Model formulas. Use `\model{A}{B+C}` to generate `A ~ B+C` . Often, you may want to use variable names, for instance `\model{\VN{height}}{\VN{age}+\VN{sex}}` gives `height ~ age + sex`. + * *Start R in Calculus* [[Amazon]](http://www.amazon.com/Start-Calculus-Daniel-T-Kaplan/dp/0983965897) + This book describes the use of R in calculus based on the successful + redesign of the first semester calculus course at Macalester College. + - - -### R-forge svn - -The mosaic R-forge repository contains - -* The `.Rnw` files -* The dependencies in `bin/` -* *LaTeX* dependencies in `inputs/` - * `problems.sty` (for problems and solutions) - * `authNote.sty` (for author notes) - * `probstat.sty` (for some prob/stat macros) - * `sfsect.sty` (for san serif section title fonts) - -You may need to set an environment variable to make *LaTeX* look here. -* the `cache/` and `figures/` directories (so that `make` can be used -without complaint), but *not* their contents, which are generated by `sweave`. -* screenshots and other images not generated by `sweave` are in `images/` - -Appendix B: Possible Additional Topics -===================== - -Do we want to include any of these topics? -* Fancier Lattice Graphics -* Base Graphics -* Making plots with `ggplots2` - Only if one of us knows or wants to learn this system. -* Writing executable R scripts -* R Infrastructure for Teaching - Whatever of this we include might end up in the chapters rather than in -an appendix. -* Sharing in R Studio -* Public Data -* Google Data -* Making Data Available Online -* A Brief Tour of knitr and R-markdown -* exams -* Books - * Our books - * Chance et al (in progress) - * Existing books that work well/poorly with R (and why) -* Online materials +There is a spanish language translation of the *Student's Guide to R* available at https://github.com/jarochoeltrocho/MOSAIC-LittleBooks-Spanish (kudos to Francisco Javier Jara Ávila, https://github.com/jarochoeltrocho) diff --git a/Simulation/SimulationBased.Rnw b/Simulation/MOSAIC-SimulationBased.Rnw similarity index 100% rename from Simulation/SimulationBased.Rnw rename to Simulation/MOSAIC-SimulationBased.Rnw diff --git a/Simulation/Master/.gitignore b/Simulation/Master/.gitignore deleted file mode 100644 index e9701ae..0000000 --- a/Simulation/Master/.gitignore +++ /dev/null @@ -1,7 +0,0 @@ -*.log -*.notes -*.syntex.gz -*.toc -*-concordance.tex -*.synctex.gz -*.tex diff --git a/Simulation/Master/Master-Simulation.Rnw b/Simulation/Master/Master-Simulation.Rnw deleted file mode 100644 index 8d46adb..0000000 --- a/Simulation/Master/Master-Simulation.Rnw +++ /dev/null @@ -1,13 +0,0 @@ -% All pre-amble stuff should go into ../include/MainDocument.Rnw -\title{Simulation-Based Inference} -\author{Randall Pruim and Nicholas Horton and Daniel Kaplan} -\date{DRAFT: \today} -\Sexpr{set_parent('../../include/MainDocument.Rnw')} % All the latex pre-amble for the book -\maketitle - -\tableofcontents - -\newpage - -\import{../}{SimulationBased} - diff --git a/Starting/Cover/.DS_Store b/Starting/Cover/.DS_Store deleted file mode 100644 index 43087e1..0000000 Binary files a/Starting/Cover/.DS_Store and /dev/null differ diff --git a/Starting/Cover/.gitignore b/Starting/Cover/.gitignore new file mode 100644 index 0000000..e43b0f9 --- /dev/null +++ b/Starting/Cover/.gitignore @@ -0,0 +1 @@ +.DS_Store diff --git a/Starting/EarlyRExamples.Rnw b/Starting/EarlyRExamples.Rnw index bf4eb6c..69a20cb 100644 --- a/Starting/EarlyRExamples.Rnw +++ b/Starting/EarlyRExamples.Rnw @@ -1,6 +1,6 @@ <>= opts_chunk$set( fig.path="figures/EarlyR-" ) -set_parent("Master-Starting.Rnw") +set_parent("MOSAIC-StartTeaching.Rnw") set.seed(123) @ @@ -221,7 +221,7 @@ get quite a few correct -- maybe even all 10. But how likely is that? Let's try an experiment. I'll flip 10 coins. You guess which are heads and which are tails, and we'll see how you do. -\marginnote{Have each student make a guess by writing down a sequence +\TeachingTip{Have each student make a guess by writing down a sequence of 10 H's or T's while you flip the coin behind a barrier so that the students cannot see the results. } @@ -256,7 +256,7 @@ do the flipping for us. The \function{rflip()} function can flip one coin -\marginnote[3cm]{There is a subtle switch here. Before we were asking how +\Note[3cm]{There is a subtle switch here. Before we were asking how many of the students H's and T's matched the flipped coin. Now we are using H to simulate a correct guess and T to simulate an incorrect guess. This makes simulating easier.} @@ -274,13 +274,14 @@ rflip(10) Typing \code{rflip(10)} a bunch of times is almost as tedious as flipping all those coins. But it is not too hard to tell \R\ to \function{do()} this a bunch of times. -\marginnote{Notice that \function{do()} is clever about what information it records. Rather than recording all of the individual tosses, it is only recording the number of flips, the number of heads, and the number of tails.}% +\Note{Notice that \function{do()} is clever about what information it records. Rather than recording all of the individual tosses, it is only recording the number of flips, the number of heads, and the number of tails.}% <>= do(3) * rflip(10) @ +\newpage \noindent -Let's get \R\ to \function{do()} it for us 10,000 times and make a table of the results. +Now let's get \R\ to \function{do()} it for us 10,000 times and make a table of the results. <>= set.seed(123) @@ -292,26 +293,25 @@ perform 10,000 or more simulations live in class. For more complicated things ( might require fitting a model and extracting information from it at each iteration) you might prefer a smaller number for live demonstrations. -When you cover -inference for a proportion, it is a good idea to use those methods to revisit -the question of how many replications are required in that context.} +When you cover inference for a proportion, it is a good idea to use those methods to +revisit the question of how many replications are required in that context.} + + <>= # store the results of 10000 simulated ladies random.ladies <- do(10000) * rflip(10) @ <>= -options( width=60 ) - <>= tally(~heads, data=random.ladies) -# We can also display table using percentages +# We can also display a table using percentages tally(~heads, data=random.ladies, format="prop") @ We can display this table graphically using a plot called a \term{histogram} with bins of width~1. -\marginnote{The \pkg{mosaic} package adds some additional +\Note{The \pkg{mosaic} package adds some additional features to \function{histogram()}. In particular, the \option{width} and \option{center} arguments, which make it easier to control the bins, are only available if you are using the \pkg{mosaic} package.} @@ -354,16 +354,15 @@ design change things? We could simulate this by shuffling a deck of 10 cards and dealing five of them. -\begin{widestuff} +\Note{The use of \function{factor} here lets \R\ +know that the possible values are `M' and `T', even when only one +or the other appears in a given random sample.} <>= -cards <- factor(c("M","M","M","M","M","T","T","T","T","T")) +cards <- + factor(c("M","M","M","M","M","T","T","T","T","T")) tally(~deal(cards, 5)) @ -\end{widestuff} -\marginnote[2cm]{The use of \function{factor} here lets \R\ -know that the possible values are `M' and `T', even when only one -or the other appears in a given random sample.} <>= results <- do(10000) * tally(~deal(cards, 5)) tally(~ M, data=results) @@ -375,15 +374,17 @@ tally(~ M, data=results, format="perc") \label{sec:Births78Intro} The \dataframe{Births78} data set contains the number of births in the United States for each day of 1978. -\marginnote{The use of the phrase ``depends on'' is intentional. -Later we will emphasize how \texttt{\~} can often be interpreted as ``depends on''.} +\Note{The use of the phrase ``depends on'' is intentional. +Later we will emphasize how \code{y ~ x} can often be interpreted as +``\code{y} depends on \code{x}''.} A scatter plot of births by day of year reveals some interesting patterns. Let's see how the number of births depends on the day of the year. -\TeachingTip{The plot could also be made using \variable{date}. For general -purposes, this is probably the better plot to make, but using \variable{dayofyear} forces students to think more about what the x-axis means.} <>= xyplot(births ~ dayofyear, data=Births78) @ +\TeachingTip[-2cm]{The plot could also be made using \variable{date}. For general +purposes, this is probably the better plot to make, but using \variable{dayofyear} +forces students to think more about what the x-axis means.} When shown this image, students should readily be able to describe two patterns in the data; they should notice both the rise and fall over the course of the year and the two ``parallel waves". \TeachingTip{This can make a good ``think-pair-share'' activity. Have students come up with possible explanations, then discuss these explanations with a partner. Finally, have some of the pairs share their explanations with the entire class.} @@ -396,11 +397,13 @@ One conjecture about the parallel waves can be checked using the data at hand. I <>= trellis.par.set(superpose.symbol=list(pch=16, alpha=.6, cex=.6)) @ +\TeachingTip{The handful of exceptions are easier to see if we ``connect the dots''. +See Section~\ref{sec:births-lines}.} <>= require(mosaicData) # load mosaic data sets xyplot(births ~ dayofyear, data=Births78, - groups=dayofyear%%7, + groups=wday, auto.key=list(space="right")) @ @@ -423,20 +426,21 @@ xyplot(sat ~ expend, data=SAT) @ The implication, that spending less might give better results, is not justified. Expenditures are confounded with the proportion of students who take the exam, and scores are higher in states where fewer students take the exam. -<>= +<>= xyplot(expend ~ frac, data=SAT) xyplot(sat ~ frac, data=SAT) @ It is interesting to look at the original plot if we place the states into two groups depending on whether more or fewer than 40\% of students take the SAT: -<>= +<<>>= SAT <- mutate(SAT, fracGroup = derivedFactor( hi = (frac > 40), lo = (frac <=40) )) @ -<>= +<>= +xyplot(expend ~ frac, data=SAT) xyplot( sat ~ expend | fracGroup , data=SAT, type=c("p","r") ) xyplot( sat ~ expend, groups = fracGroup , data=SAT, @@ -470,27 +474,17 @@ Is there a relationship between infestation and Wilt disease? The accompanying table shows a cross tabulation the number of plants that developed symptoms of Wilt disease. - -<>= -Mites <- data.frame( - mites = c(rep("Yes", 11), rep("No", 17), - rep("Yes", 15), rep("No", 4)), - wilt = c(rep("Yes", 28), rep("No", 19)) -) -@ - \newpage -\vspace*{-10mm} <>= -tally(~ wilt + mites, Mites) +tally(outcome ~ treatment, data = Mites, margins = TRUE) @ -\vspace*{-5mm} +\noindent Some questions for students: \begin{enumerate} \setlength\itemsep{1mm} - \item Here, what do you think is the explanatory variable? Response variable? + \item What do you think is the explanatory variable? Response variable? \item What proportion of the plants in the study with mites developed Wilt disease? \item What proportion of the plants in the study with no mites developed Wilt disease? \item Relative risk is the ratio of two risk proportions. What is the relative risk @@ -566,12 +560,13 @@ simulations very quickly. \begin{boxedText} \centerline{\textbf{Computational Simulation}} -<>= -tally(~ wilt + mites, data=Mites) -X <- tally(~ wilt + mites, data=Mites)["No","No"]; X +<>= +tally(outcome ~ treatment, data=Mites) +X <- tally(outcome ~ treatment, data=Mites)[1,1]; X nullDist <- do(1000) * - tally(~ wilt + shuffle(mites), data=Mites)["No","No"] -histogram(~ result, data=nullDist, width=1, type="density", fit="normal") + tally(outcome ~ shuffle(treatment), data=Mites)[1,1] +histogram(~ result, data=nullDist, width=1, + type="density", fit="normal", v=15) @ \end{boxedText} diff --git a/Starting/FrontMatter.Rnw b/Starting/FrontMatter.Rnw index 2a2e9d1..c0df27e 100644 --- a/Starting/FrontMatter.Rnw +++ b/Starting/FrontMatter.Rnw @@ -8,30 +8,38 @@ set.seed(123) \chapter*{About These Notes} -We present an approach to teaching introductory and intermediate -statistics courses that is tightly coupled with computing generally and with \R\ and \RStudio\ in particular. These activities and examples are intended to highlight a modern approach to statistical education that focuses on modeling, resampling based inference, and multivariate graphical techniques. A secondary goal is to -facilitate computing with data through use of small simulation studies %data scraping from the internet -and appropriate statistical analysis workflow. This follows the -philosophy outlined by Nolan and Temple Lang\cite{nola:temp:2010}. The importance of modern computation\marginnote{$\ $} in statistics education is a principal component of the recently adopted American Statistical Association's curriculum guidelines\cite{ASAcurriculum2014}. - -Throughout this book (and its companion volumes), we -introduce multiple activities, some -appropriate for an introductory course, others suitable for higher levels, that -demonstrate key concepts in statistics and modeling -while also supporting the core material of more traditional courses. +We present an approach to teaching introductory and intermediate statistics +courses that is tightly coupled with computing generally and with \R\ and +\RStudio\ in particular. These activities and examples are intended to +highlight a modern approach to statistical education that focuses on modeling, +resampling based inference, and multivariate graphical techniques. A secondary +goal is to facilitate computing with data through use of small simulation +studies and appropriate statistical analysis workflow. This follows the +philosophy outlined by Nolan and Temple Lang\cite{nola:temp:2010}. +The importance of modern computation\marginnote{$\ $} +in statistics education is a principal component of the recently adopted +American Statistical Association's curriculum guidelines\cite{ASAcurriculum2014}. + +Throughout this book (and its companion volumes), we introduce multiple +activities, some appropriate for an introductory course, others suitable for +higher levels, that demonstrate key concepts in statistics and modeling while +also supporting the core material of more traditional courses. \subsection*{A Work in Progress} \Caution{Despite our best efforts, you WILL find bugs both in this document and in our code. Please let us know when you encounter them so we can call in the exterminators.}% -These materials were developed for a workshop entitled +These materials were originally developed for a workshop entitled \emph{Teaching Statistics Using R} prior to the 2011 United States Conference -on Teaching Statistics and revised for USCOTS 2011, USCOTS 2013, eCOTS 2014, ICOTS 9, and USCOTS 2015. -We organized these workshops to help instructors integrate \R\ (as well as some related technologies) into statistics courses at all levels. -We received great feedback and many wonderful ideas from the participants and those that we've shared this with since the workshops. - -Consider these notes to be a work in progress. +on Teaching Statistics and revised for USCOTS 2011, USCOTS 2013, eCOTS 2014, ICOTS 9, +and USCOTS 2015. +We organized these workshops to help instructors integrate \R\ (as well as some +related technologies) into statistics courses at all levels. We received great +feedback and many wonderful ideas from the participants and those that we've +shared this with since the workshops. + +%Consider these notes to be a work in progress. %\SuggestionBox{Sometimes we will mark %places where we would especially like feedback with one of these suggestion boxes. %But we won't do that everywhere we want feedback or there won't be room for @@ -46,8 +54,9 @@ Updated versions will be posted at \url{http://mosaic-web.org}. \subsection*{Two Audiences} -The primary audience for these materials is instructors of statistics at the college or -university level. A secondary audience is the students these instructors teach. +We initially developed these materials for +instructors of statistics at the college or +university level. Another audience is the students these instructors teach. Some of the sections, examples, and exercises are written with one or the other of these audiences more clearly at the forefront. This means that \begin{enumerate} @@ -66,29 +75,11 @@ fit into each category as we go along. Download and installation are quite straightforward for Mac, PC, or linux machines. \RStudio\ is an integrated development environment (IDE) that facilitates use of \R\ for both novice and expert users. We have adopted it as our standard teaching environment because it dramatically simplifies the use of \R\ for instructors and for students.% -\Pointer[-3cm]{Several things we use that can be done only in \RStudio, for instance \function{manipulate} or \RStudio's support for reproducible research).}% +\Pointer[-3cm]{Several things we use that can be done only in \RStudio, for instance \function{manipulate} or \RStudio's integrated support for reproducible research).}% %\RStudio\ is available from \url{http://www.rstudio.org/}. -\RStudio\ can be installed as a desktop (laptop) application or as a server application that is accessible to users via the Internet.\TeachingTip[-.5cm]{RStudio server version works well with starting students. All they need is a web browser, avoiding any potential problems with oddities of students' individual computers.} +\RStudio\ can be installed as a desktop (laptop) application or as a server application that is accessible to users via the Internet.\FoodForThought[-.5cm]{RStudio server version works well with starting students. All they need is a web browser, avoiding any potential problems with oddities of students' individual computers.} In addition to \R\ and \RStudio, we will make use of several packages that need to be installed and loaded separately. The \pkg{mosaic} package (and its dependencies) will be used throughout. Other packages appear from time to time as well. -\iffalse -including -\begin{multicols}{3} -\begin{itemize} -\item -\pkg{fastR} -\item -\pkg{abd} -%\item -%\pkg{Zillow} -\item -\pkg{twitteR} -\item -\pkg{vcd} -\end{itemize} -\end{multicols} -\authNote{Can we prune this list?} -\fi %\subsection*{Notation} @@ -131,9 +122,20 @@ reproducible analysis methods. For beginners, we introduce \pkg{knitr} with RMarkdown, which produces PDF, HTML, or Word files using a simpler syntax.} -This document was created on -\today, using \pkg{knitr} and -\Sexpr{R.version.string}. +\subsection*{Document Creation} + +This document was created on \today, using +\begin{itemize} +\item \pkg{knitr}, version \Sexpr{packageVersion("knitr")} +\item \pkg{mosaic}, version \Sexpr{packageVersion("mosaic")} +\item \pkg{mosaicData}, version \Sexpr{packageVersion("mosaic")} +\item \Sexpr{R.version.string} +\end{itemize} + +Inevitably, each of these will be updated from time to time. +If you find that things look different on your computer, make sure that your +version of \R{} and your packages are up to date and check for a newer version +of this document. @@ -208,7 +210,8 @@ incremental modifications of existing resources that draw on the connections between the MOSAIC topics. \end{description} -We welcome and encourage your participation in all of these initiatives. +More details can be found at \url{http://www.mosaic-web.org}. +We welcome and encourage your participation in all of these initiatives. \chapter*{Computational Statistics} @@ -224,8 +227,8 @@ tool to replace pencil-and-paper calculations and drawing plots manually. In the second approach, more fundamental changes in the course result from the introduction of the computer. Some new topics are covered, some old topics are omitted. Some old topics are treated in very different ways, and perhaps at different points in the course. We will refer to this approach as \term{computational statistics} because the availability of computation is shaping how statistics is done and taught. Computational statistics is a key component of \term{data science}, defined as the ability to use data to answer questions and communicate those results. -\FoodForThought{Our students need to see aspects of computation and data science early and often -to develop deeper skills. Establishing precursors in introductory courses will help them get started.}% +\FoodForThought{Students need to see aspects of computation and data science early and often +to develop deeper skills. Establishing precursors in introductory courses help them get started.}% In practice, most courses will incorporate elements of both statistical computation and computational statistics, but the relative proportions may differ dramatically from course to course. @@ -251,16 +254,16 @@ At the same time, the development of \R\ and of \RStudio\ (an optional interface and integrated development environment for \R) are making it easier and easier to get started with \R. -Nevertheless, those who are unfamiliar with \R\ or who have never used \R\ for teaching are understandably cautious about using it with students. If you are in that category, then this book is for you. Our goal is to reveal some of what we have learned teaching with \R\ and to make teaching statistics with \R\ as rewarding and easy as possible -- for both students and faculty. We will cover both technical aspects of \R\ and \RStudio\ (e.g., how do I get \R\ to do thus and such?) as well as some perspectives on how to use computation to teach statistics. The latter will be illustrated in \R\ but would be equally applicable with other statistical software. +%Nevertheless, those who are unfamiliar with \R\ or who have never used \R\ for teaching are understandably cautious about using it with students. If you are in that category, then this book is for you. Our goal is to reveal some of what we have learned teaching with \R\ and to make teaching statistics with \R\ as rewarding and easy as possible -- for both students and faculty. We will cover both technical aspects of \R\ and \RStudio\ (e.g., how do I get \R\ to do thus and such?) as well as some perspectives on how to use computation to teach statistics. The latter will be illustrated in \R\ but would be equally applicable with other statistical software. + +%Others have used \R\ in their courses, but have perhaps left the course feeling +%like there must have been better ways to do this or that topic. If that +%sounds more like you, then this book is for you, too. As we have been working +%on this book, we have also been developing the \pkg{mosaic} -Others have used \R\ in their courses, but have perhaps left the course feeling -like there must have been better ways to do this or that topic. If that -sounds more like you, then this book is for you, too. As we have been working -on this book, we have also been developing the \pkg{mosaic} +\FoodForThought{Information about the \pkg{mosaic} package, including vignettes demonstrating features and supplementary materials (such as this book) can be found at \url{https://cran.r-project.org/web/packages/mosaic}.} +We developed the \pkg{mosaic} \R\ package (available on CRAN) to make certain aspects of statistical -computation and computational statistics simpler for beginners. -You will also find here some of our favorite activities, examples, and data -sets, as well as answers to questions that we have heard frequently from both students -and faculty colleagues. We invite you to scavenge from our materials and ideas -and modify them to fit your courses and your students. +computation and computational statistics simpler for beginners, without limiting their ability to +use more advanced features of the language. The \pkg{mosaic} package includes a modelling approach that uses the same general syntax to calculate descriptive statistics, create graphics, and fit linear models. diff --git a/Starting/Master-Starting.Rnw b/Starting/MOSAIC-StartTeaching.Rnw similarity index 93% rename from Starting/Master-Starting.Rnw rename to Starting/MOSAIC-StartTeaching.Rnw index 0241c22..1e1b3a9 100644 --- a/Starting/Master-Starting.Rnw +++ b/Starting/MOSAIC-StartTeaching.Rnw @@ -1,11 +1,53 @@ \documentclass[openany]{tufte-book} +<>= +#setCacheDir("cache") +require(MASS) +require(grDevices) +require(datasets) +require(stats) +require(lattice) +require(grid) +# require(fastR) # commented out by NH on 7/12/2012 +require(mosaic) +require(mosaicData) +trellis.par.set(theme=col.mosaic(bw=FALSE)) +trellis.par.set(fontsize=list(text=9)) +options(format.R.blank=FALSE) +options(width=60) +options(digits=3) +require(vcd) +require(knitr) +opts_chunk$set( tidy=FALSE, + size='small', + dev="pdf", + fig.path="figures/fig-", + fig.width=3, fig.height=2, + fig.align="center", + fig.show="hold", + comment=NA) +knit_theme$set("greyscale0") +@ + +<>= +includeChapter <- FALSE # don't show chapters not yet being paginated. +showEdited <- FALSE # displaying chapters already paginated for printing +knit_hooks$set(document = function(x) { + sub('\\usepackage[]{color}', '\\usepackage[]{xcolor}', + x, fixed = TRUE) +}) + +#knit_hooks$set(document = function(x) { +# gsub('(\\\\end\\{knitrout\\})\n', '\\1', x) +#}) +@ -\usepackage{RBook} + +\usepackage{../include/RBook} \usepackage{pdfpages} %\usepackage[shownotes]{authNote} \usepackage[hidenotes]{authNote} -\usepackage{language} +\usepackage{language} % available at https://github.com/rpruim/latex \usepackage{hyperref} \usepackage{fancyhdr} % DTK added for header. @@ -83,7 +125,7 @@ \title{Start Teaching with R} -\author[Pruim, Horton & Kaplan]{Randall Pruim, Nicholas J. Horton, and Daniel Kaplan} +\author[Pruim, Horton \& Kaplan]{Randall Pruim, Nicholas J. Horton, and Daniel T. Kaplan} \date{January 2015} \begin{document} @@ -94,47 +136,6 @@ % a blank line following an R chunk. \renewenvironment{knitrout}{}{\noindent\ignorespaces\!\!} -<>= -#setCacheDir("cache") -require(MASS) -require(grDevices) -require(datasets) -require(stats) -require(lattice) -require(grid) -# require(fastR) # commented out by NH on 7/12/2012 -require(mosaic) -require(mosaicData) -trellis.par.set(theme=col.mosaic(bw=FALSE)) -trellis.par.set(fontsize=list(text=9)) -options(format.R.blank=FALSE) -options(width=70) -require(vcd) -require(knitr) -opts_chunk$set( tidy=FALSE, - size='small', - dev="pdf", - fig.path="figures/fig-", - fig.width=3, fig.height=2, - fig.align="center", - fig.show="hold", - comment=NA) -knit_theme$set("greyscale0") -@ - -<>= -includeChapter <- FALSE # don't show chapters not yet being paginated. -showEdited <- FALSE # displaying chapters already paginated for printing -knit_hooks$set(document = function(x) { - sub('\\usepackage[]{color}', '\\usepackage[]{xcolor}', - x, fixed = TRUE) -}) - -#knit_hooks$set(document = function(x) { -# gsub('(\\\\end\\{knitrout\\})\n', '\\1', x) -#}) -@ - %\maketitle \includepdf{frontice} @@ -142,10 +143,10 @@ knit_hooks$set(document = function(x) { \newpage \vspace*{2in} -\parbox{4in}{\noindent Copyright (c) 2015 by Randall Pruim, Nicholas Horton, \& Daniel Kaplan.} +\parbox{4in}{\noindent Copyright (c) 2015 by Randall Pruim, Nicholas J. Horton, \& Daniel T. Kaplan.} \medskip -\parbox{4in}{\noindent Edition 1.0, January 2015} +\parbox{4in}{\noindent Edition 1.1, November 2015} \bigskip @@ -179,6 +180,9 @@ knit_hooks$set(document = function(x) { <>= @ +<>= +@ + <>= @ @@ -188,7 +192,7 @@ knit_hooks$set(document = function(x) { <>= @ -%\backmatter +\backmatter \bibliographystyle{alpha} \bibliography{../include/USCOTS} diff --git a/Starting/MOSAIC-StartTeaching.pdf b/Starting/MOSAIC-StartTeaching.pdf new file mode 100644 index 0000000..410a52e Binary files /dev/null and b/Starting/MOSAIC-StartTeaching.pdf differ diff --git a/Starting/Master/.gitignore b/Starting/Master/.gitignore deleted file mode 100644 index 63cd05f..0000000 --- a/Starting/Master/.gitignore +++ /dev/null @@ -1,12 +0,0 @@ -Master-Starting-concordance.tex -*.log -*.pdf -*.synctex.gz -*.tex -*.toc -*.aux -*.bbl -*.blg -Rindex.idx -framed.sty -mainIndex.idx diff --git a/Starting/Master/Master-Starting.Rnw b/Starting/Master/Master-Starting.Rnw deleted file mode 100644 index c2b28be..0000000 --- a/Starting/Master/Master-Starting.Rnw +++ /dev/null @@ -1,55 +0,0 @@ - - -\documentclass[open-any,12pt]{tufte-book} -\usepackage{../../include/RBook} -\title{Start Teaching with R} -\author{Randall Pruim and Nicholas Horton and Daniel Kaplan} -\date{DRAFT: \today} - -<>= -..makingMaster.. <- TRUE -@ - -\maketitle - -\tableofcontents - -\newpage - -<>= -@ - -<>= -@ - -<>= -@ - -<>= -@ - -<>= -@ - -<>= -@ - -<>= -@ - -<>= -@ - -<>= -@ - -<>= -@ - -<>= -@ - - -% \chapter{The second chapter} -% -% \import{../}{example-file} diff --git a/Starting/RForInstructors.Rnw b/Starting/RForInstructors.Rnw index e414752..440d138 100644 --- a/Starting/RForInstructors.Rnw +++ b/Starting/RForInstructors.Rnw @@ -1,6 +1,6 @@ <>= -opts_chunk$set( fig.path="figures/RForInstructors-", tidy=FALSE ) -set_parent('Master-Starting.Rnw') +opts_chunk$set(fig.path="figures/RForInstructors-", tidy=FALSE) +set_parent("MOSAIC-StartTeaching.Rnw") set.seed(123) require(fastR) @ @@ -33,7 +33,7 @@ Our workflow advice can be summarized in one short sentence: \BlankNote{We don't really think of our classroom use of \R\ as programming since we use \R\ in a mostly declarative rather than algorithmic way.}% -% + It doesn't take sophisticated programming skills to be good at using \R. In fact, most uses of \R\ for teaching statistics can be done working one step at a time, where each line of code does one complete and useful task. After inspecting the output @@ -52,14 +52,14 @@ less error-prone. Get in the habit (and get your students in the habit) of working with \R\ scripts and especially RMarkdown files. -You can execute all the code in an \R\ script file using -\Pointer[-2cm]{\R\ can be used to create executable scripts. Option parsing and handling is supported with the \pkg{optparse} package.} +\Pointer[0cm]{\R\ can be used to create executable scripts. Option parsing and handling is supported with the \pkg{optparse} package.} +You can execute all the code in an \R\ script file using \Rindex{source()} <>= source("file.R") @ - +\noindent \Rstudio\ has additional options for executing some or all lines in a file. See the buttons in the tab for any \R\ script, RMarkdown or Rnw file. (You can create a new file in the main \tab{File} menu.) @@ -79,8 +79,10 @@ you can selectively copy portions of your history to a script file Rarely should objects be named with a single letter. - Adopt a personal convention regarding case of letters. This will mean you have one less thing to remember when trying to recall the name of an object. For - example, in the \pkg{mosaic} package, all data frames begin with a + Adopt a personal convention regarding case of letters. + This will mean you have one less thing to remember when trying to recall + the name of an object. + For example, in the \pkg{mosaicData} package, all data frames begin with a capital letter. Most variables begin with a lower case letter (a few exceptions are made for some variables with names that are well-known in their capitalized form). @@ -88,17 +90,18 @@ you can selectively copy portions of your history to a script file \item Adopt reusable idioms. - Computer programmers refer to the little patterns that recur throughout - their code as idioms. For example, here is a ``compute, save, display'' +\enlargethispage{1in} + + Computer programmers refer to the little patterns that recur throughout their code as idioms. For example, here is a ``compute, save, display'' idiom. <>= # compute, save, display idiom -footModel <- lm( length ~ width, data=KidsFeet ); footModel +footModel <- lm(length ~ width, data=KidsFeet); footModel @ <>= # alternative that reflects the order of operations -lm( length ~ width, data=KidsFeet ) -> footModel; footModel +lm(length ~ width, data=KidsFeet) -> footModel; footModel @ Often there are multiple ways to do the same thing in \R, @@ -148,9 +151,13 @@ class(KidsFeet$length) class(KidsFeet$sex) str(KidsFeet) # show the class for each variable @ -\end{widestuff} \Rindex{KidsFeet} +\Pointer{One difference between a factor and a character is +that a factor knows the possible values, even if some them +do not occur. Sometimes this is an advantage (tallying empty +cells in a table) and sometimes it is a disadvantage (when factors +are used as unique identifiers).}% From this we see that \dataframe{KidsFeet} is a data frame and that the variables are of different types (integer, numeric, and factor). These are the kinds of variables you are most likely to encounter, although @@ -159,11 +166,6 @@ as well. Factors are the most common way for categorical data to be stored in \R, but sometimes the character class is better. -\Pointer{One difference between a factor and a character is -that a factor knows the possible values, even if some them -do not occur. Sometimes this is an advantage (tallying empty -cells in a table) and sometimes it is a disadvantage (when factors -are used as unique identifiers).}% The class of an object determines what things can be done with it and how it appears when printed, plotted, or displayed in the console. @@ -177,7 +179,6 @@ integer but a collection of integers. So we can think of \variable{birthmonth} There is more than one kind of container in \R. The containers used for variables in a data frame are called \term{vectors}. \myindex{vector}% The items in a vector are ordered (starting with 1) and must all be of the same type. -\DiggingDeeper[-1cm]{In fact, they must all be of the same \emph{atomic} type. Atomic types are are the basic building blocks for \R. It is not possible to store more complicated objects (like data frames) in a vector.}% Vectors can be created using the \function{c()} function: \Rindex{c()} @@ -206,7 +207,7 @@ z <- c(1, TRUE, 1.2, "vector"); z # all converted to character class(z) @ -\DiggingDeeper{A factor can be ordered or unordered (which can affect how statistics tests are performed but otherwise does not matter much). The default is for factors to be unordered. Whether the factors are ordered or unordered, thelevels will appear in a fixed order -- alphabetical by default. The distinction between ordered and unordered factors has to do with whether this order is meaningful or arbitrary.}% +\DiggingDeeper{A factor can be ordered or unordered (which can affect how statistics tests are performed but otherwise does not matter much). The default is for factors to be unordered. Whether the factors are ordered or unordered, the levels will appear in a fixed order -- alphabetical by default. The distinction between ordered and unordered factors has to do with whether this order is meaningful or arbitrary.}% % Factors can be created by wrapping a vector with \function{factor()}: \Rindex{factor()} @@ -238,16 +239,17 @@ square bracket operator: w[1] x[2] y[3] -z[5] # this is not an error, but returns NA (missing) @ + Missing values are coded as \code{NA} (not available). Asking for an entry ``off the end'' of a vector returns \code{NA}. Assigning a value ``off the end'' of a vector results in the vector being lengthened so that the new value can be stored in the appropriate location. <<>>= +z[5] # this is not an error, but returns NA (missing) q <- 1:5 q -q[10] <- 10 +q[10] <- 10 # elements 6 thru 9 will be filled with NA q @ @@ -273,9 +275,16 @@ y [ y > 20 ] # select the items greater than 20 @ The last item deserves a bit of comment. The expression inside the brackets evaluates to a vector of logical values. +<>= +oldparams <- options() +options(width = 90) +@ <<>>= y > 20 @ +<>= +options(oldparams) +@ The logical values are then used to select (true) or deselect (false) the items in the vector, producing a new (and potentially shorter) vector. If the number of logical supplied is less than the length of the @@ -313,17 +322,25 @@ ncol(KidsFeet) @ \myindex{list} +\Pointer{In official \R{} parlance, the distinction we make between vectors and lists +is really the distinction between \emph{atomic} vectors and lists (which are also called +\emph{generic} vectors). +In fact, they must all be of the same \emph{atomic} type. +Atomic vectors are are the basic building blocks for \R. +It is not possible to store more complicated objects (like data frames) in a +vector, but they can be stored in a list.}% + Another commonly used container in \R\ is a list. We have already seen a few examples of lists used as arguments to \pkg{lattice} plotting functions. -Lists are also ordered, but the items in a list can be objects of any type (they -need not all be the same type). +Lists are also ordered, but the items in a list can be objects of any type, and they +need not all be the same type. Behind the scenes, a data frame is a list of vectors with the restriction that each vector must have the same \rterm{length} (contain the same number of items). \Rindex{length()} Lists can be created using the \function{list()} function. <<>>= -l <- list( 1, "two", 3.2, list(1, 2)); l +l <- list(1, "two", 3.2, list(1, 2)); l length(l) # Note: l has 4 elements, not 5 @ Items in a list can be accessed with the double square bracket (\code{[[ ]]}). @@ -348,7 +365,7 @@ be accessed by name as well as by position. x <- c(one=1, two=2, three=3); x y <- list(a=1, b=2, c=3); y x["one"] -y["a"] +y[["a"]] # retrieve items from a list with [[ ]] names(x) names(x) <- c("A", "B", "C"); x @ @@ -416,6 +433,8 @@ log(x) # natural log log10(x) # base 10 log @ +\enlargethispage{1in} + \noindent Vectors can be combined into a matrix using \function{rbind()} or \function{cbind()}. @@ -476,7 +495,13 @@ cumprod(x) # cumulative product \label{r:sumprod}% -Whether a function is vectorized or treats a vector as a unit depends on its implementation. Usually, things are implemented the way you would expect. Occasionally you may discover a function that you wish were vectorized and is not. When writing your own functions, give some thought to whether they should be vectorized, and test them with vectors of length greater than 1 to make sure you get the intended behavior. +Whether a function is vectorized or treats a vector as a unit depends on its implementation. +Usually, things are implemented the way you would expect. +\Pointer{The \function{Vectorize()} function is a useful tool for converting +a non-vectorized function into a vectorized function.}% +Occasionally you may discover a function that you wish were vectorized and is not. +When writing your own functions, give some thought to whether they should be vectorized, +and test them with vectors of length greater than 1 to make sure you get the intended behavior. \Rindex{sum()}% \Rindex{prod()}% \Rindex{cumsum()}% @@ -532,7 +557,7 @@ The operations listed below can be helpful when writing your own functions. & Returns a \verb!logical! indicating whether any elements of \verb!x! are true. - Typical use: \verb!if ( any(y > 5) ) { ...}!. + Typical use: \verb!if (any(y > 5)) { ...}!. \\ \hline \verb!na.omit(x)! & Returns a vector with missing values removed. \\ \hline @@ -722,11 +747,11 @@ resample(1:6, size=20) @ \Rindex{Cards} \Rindex{deal()} -For working with cards, the \pkg{mosaic} package provides a vector named \variable{Cards} +For working with cards, the \pkg{mosaicData} package provides a vector named \variable{Cards} and \function{deal()} as an alternative name for \function{sample()}. <<>>= -deal( Cards, 5 ) # poker hand -deal( Cards, 13 ) # bridge, anyone? +deal(Cards, 5) # poker hand +deal(Cards, 13) # bridge, anyone? @ If you want to sort the hands nicely, you can create a factor from \variable{Cards} first: @@ -735,7 +760,7 @@ first: \begin{widestuff} <<>>= -hand <- deal( factor(Cards, levels=Cards), 13 ) +hand <- deal(factor(Cards, levels=Cards), 13) sort(hand) # sorted by suit, then by denomination @ \end{widestuff} @@ -804,7 +829,7 @@ sample from the desired distribution and make a histogram of the resulting sample. <<>>= x1 <- rnorm(500, mean=10, sd=2) -histogram(~x1, width=.5) +histogram( ~ x1, width=.5) @ This works, but the resulting plot has a fair amount of noise. @@ -812,8 +837,8 @@ The \function{ppoints()} function returns evenly spaced probabilities and allows us to obtain theoretical quantiles of the normal distribution instead. The resulting plot now illustrates the idealized sample from a normal distribution. <<>>= -x2 <- qnorm( ppoints(500), mean=10, sd=2 ) -histogram(~x2, width=.5) +x2 <- qnorm(ppoints(500), mean=10, sd=2) +histogram( ~ x2, width=.5) @ This is not what real data will look like (even if it comes from a normal population), but it can be better for illustrative purposes to remove the noise. @@ -837,7 +862,7 @@ write.csv(ddd, "ddd.csv") Data can also be saved in native \R\ format. Saving data sets (and other \R\ objects) using \function{save()} has some advantages over other file formats: -\Pointer[-2cm]{If you want to save an \R\ object but not its name, you can use \function{saveRDS()} and choose its name when you read it with \function{readRDS()}.} +\Pointer[-1cm]{If you want to save an \R\ object but not its name, you can use \function{saveRDS()} and choose its name when you read it with \function{readRDS()}.} \begin{itemize} \item Complete information about the objects is saved, including attributes. @@ -894,31 +919,26 @@ can be used to combine data from multiple data frames. \subsection{Adding new variables to a data frame} The \function{mutate()} function can be used to add or modify variables in a data frame. -\Note{\function{mutate()} is evaluated in such a way that you have direct -access to the other variables in the data frame, including one created earlier +\Note{\function{mutate()} access to the other variables in the data frame, including any created earlier in the same \function{mutate()} command.} Here we show how to modify the \dataframe{Births78} data frame so -that it contains a new variable \variable{day} that is an ordered factor. -%(Details about some of the functions involved will be presented later -%in this chapter). -\Pointer[3.5cm]{The \pkg{lubridate} package provides a -\function{wday()} function that can do this more simply -and directly from the \variable{date} variable as well as -a number of utilities for creating and manipulating date and time objects.} +that it contains a new variable \variable{weekend} that distinguishes +between weekdays and weekends. <>= data(Births78) weekdays <- c("Sun", "Mon", "Tue", "Wed", "Thr", "Fri", "Sat") -Births <- mutate( Births78, - day = factor(weekdays[1 + (dayofyear - 1) %% 7], - ordered=TRUE, levels = weekdays) ) -head(Births,3) +Births <- + Births78 %>% + mutate(weekend = wday %in% c("Sat", "Sun")) + +head(Births, 3) @ <>= -xyplot( births ~ date, Births, groups=day, auto.key=list(space='right') ) +xyplot(births ~ date, Births, groups=weekend, auto.key=list(space='right')) @ \marginnote{Number of US births in 1978 colored by day of week.} @@ -942,24 +962,24 @@ since completing their education and that their age at graduation is 6 more than the number of years of education obtained. <<>>= CPS85 <- mutate(CPS85, workforce.years = age - 6 - educ) -favstats(~workforce.years, data=CPS85) +favstats( ~ workforce.years, data=CPS85) @ In fact this is what was done for all but one of the cases to create the \variable{exper} variable that is already in the \dataframe{CPS85} data. <<>>= -tally(~ (exper - workforce.years), data=CPS85) +tally( ~ (exper - workforce.years), data=CPS85) @ With categorical variables, sometimes we want to modify the coding scheme. <>= -HELP2 <- mutate( HELPrct, - newsex = factor(female, labels=c('M','F')) ) +HELP2 <- mutate(HELPrct, + newsex = factor(female, labels=c('M','F'))) @ It's a good idea to do some sort of sanity check to make sure that the recoding worked the way you intended <<>>= -tally( ~ newsex + female, data=HELP2 ) +tally( ~ newsex + female, data=HELP2) @ The \function{derivedFactor()} function can simplify creating factors based @@ -1038,7 +1058,7 @@ of variables to keep or discard. \Rindex{number_range()}% <<>>= -head( select(HELPrct, contains("risk")), 2 ) +head(select(HELPrct, contains("risk")), 2) @ The nested functions in the previous command make the code a bit hard to read, and things @@ -1056,9 +1076,9 @@ explicitly pass along outputs of one function as an argument to the next. Here are a few more examples: <<>>= -HELPrct %>% select( ends_with("e")) %>% head(2) -HELPrct %>% select( starts_with("h")) %>% head(2) -HELPrct %>% select( matches("i[12]")) %>% head(2) # regex matching +HELPrct %>% select(ends_with("e")) %>% head(2) +HELPrct %>% select(starts_with("h")) %>% head(2) +HELPrct %>% select(matches("i[12]")) %>% head(2) # regex matching @ \subsection{Renaming variables} @@ -1135,7 +1155,7 @@ only certain rows from a data frame. <>= # any logical can be used to create subsets faithful2 %>% filter(duration > 3) -> faithfulLong -xyplot( time_til_next ~ duration, faithfulLong ) +xyplot(time_til_next ~ duration, faithfulLong) @ \end{center} @@ -1144,10 +1164,10 @@ xyplot( time_til_next ~ duration, faithfulLong ) If all we want to do is produce a graph and don't need to save the subset, the plot above could also be made with one of the following <>= -xyplot( time_til_next ~ duration, - data = faithful2 %>% filter( duration > 3) ) -xyplot( time_til_next ~ duration, data = faithful2, - subset=duration > 3 ) +xyplot(time_til_next ~ duration, + data = faithful2 %>% filter(duration > 3)) +xyplot(time_til_next ~ duration, data = faithful2, + subset=duration > 3) @ \subsection{Summarising a data frame} @@ -1174,27 +1194,27 @@ package are probably easier for this particular task, but using \pkg{dplyr} is m OLD <- options(width=110) @ <>= -favstats( age ~ sex + substance, data=HELPrct, .format="table" ) +favstats(age ~ sex + substance, data=HELPrct, .format="table") @ <>= -favstats( age ~ sex + substance, data=HELPrct) %>% data.frame +favstats(age ~ sex + substance, data=HELPrct) %>% data.frame @ <>= -mean( age ~ sex + substance, data=HELPrct, .format="table" ) +mean(age ~ sex + substance, data=HELPrct, .format="table") @ <>= -mean( age ~ sex + substance, data=HELPrct) -> foo -foo <- data.frame( group=names(foo), mean=foo ) +mean(age ~ sex + substance, data=HELPrct) -> foo +foo <- data.frame(group=names(foo), mean=foo) row.names(foo) <- NULL foo @ <>= -sd( age ~ sex + substance, data=HELPrct, .format="table" ) +sd(age ~ sex + substance, data=HELPrct, .format="table") @ <>= -sd( age ~ sex + substance, data=HELPrct) -> foo -foo <- data.frame( group=names(foo), sd=foo ) +sd(age ~ sex + substance, data=HELPrct) -> foo +foo <- data.frame(group=names(foo), sd=foo) row.names(foo) <- NULL foo @ @@ -1218,7 +1238,7 @@ HELPrct %>% arrange(x.bar) @ - +\enlargethispage{1in} \subsection{Merging datasets} The \dataframe{fusion1} data frame in the \pkg{fastR} package contains genotype information for a SNP (single nucleotide polymorphism) in the gene \emph{TCF7L2}. The \dataframe{pheno} data frame contains phenotypes (including type 2 diabetes case/control status) for an intersecting set of individuals. We can merge these together to explore the association between genotypes and phenotypes using one of the join functions in \pkg{dplyr} or using the \function{merge()} function. @@ -1230,8 +1250,8 @@ OLD <- options(width=90) @ <<>>= require(fastR) -head(fusion1,3) -head(pheno,3) +fusion1 %>% head(3) +pheno %>% head(3) @ \end{widestuff} @@ -1240,7 +1260,7 @@ head(pheno,3) # merge fusion1 and pheno keeping only id's that are in both fusion1m <- merge(fusion1, pheno, by.x='id', by.y='id', all.x=FALSE, all.y=FALSE) -head(fusion1m, 3) +fusion1m %>% head(3) @ <>= options(OLD) @@ -1248,24 +1268,25 @@ options(OLD) \end{widestuff} <<>>= -left_join( pheno, fusion1, by="id") %>% dim() +pheno %>% left_join(fusion1, by="id") %>% dim() @ <<>>= -inner_join( pheno, fusion1, by="id") %>% dim() +pheno %>% inner_join(fusion1, by="id") %>% dim() @ <<>>= # which ids are only in \dataframe{pheno}? setdiff(pheno$id, fusion1$id) +pheno %>% anti_join(fusion1, by="id") @ -The difference between an inner join and a left join is that the inner join only includes rows from the first data frame that have a match in the second but aleft join includes all rows of the first data frame, even if they do not have a match in the second. In the example above, there are two subjects in \dataframe{pheno} that do not appear in \dataframe{fusion1}. +The difference between an inner join and a left join is that the inner join only includes rows from the first data frame that have a match in the second but a left join includes all rows of the first data frame, even if they do not have a match in the second. In the example above, there are two subjects in \dataframe{pheno} that do not appear in \dataframe{fusion1}. \function{merge()} handles these distinctions with the \option{all.x} and \option{all.y} arguments. In this case, since the values are the same for each data frame, we could collapse \option{by.x} and \option{by.y} to \option{by} and collapse \option{all.x} and \option{all.y} to \option{all}. The first of these specifies which column(s) to use to identify matching cases. The second indicates whether cases in one data frame that do not appear in the other should be kept (\code{TRUE}) or dropped (filling in \code{NA} as needed) or dropped from the merged data frame. Now we are ready to begin our analysis. <>= -tally(~t2d + genotype + marker, data=fusion1m) +tally( ~ t2d + genotype + marker, data=fusion1m) @ \begin{problem} @@ -1294,15 +1315,19 @@ have in your final data frame. \Rindex{RMySQL}% \myindex{SQL}% -The \pkg{RMySQL} package allows direct access to data in MySQL data bases and the \pkg{dplyr} package facilitates processing this data in the same way as for data in a data frame. This makes it easy to work with very large data sets stored in public databases. The example below queries the UCSC\marginnote{UCSC --- Univ. of California, Santa Cruz}% +The \pkg{RMySQL} package allows direct access to data in MySQL data bases and the \pkg{dplyr} package facilitates processing this data in the same way as for data in a data frame. This makes it easy to work with very large data sets stored in public databases. The example below queries the UCSC +\BlankNote{UCSC --- Univ. of California, Santa Cruz}% genome browser to find all the known genes on chromosome~1. \begin{widestuff} -<>= -OLD <- options( width=100 ) +<>= +library(RMySQL) +OLD <- options(width=100) @ + <>= # connect to a UCSC database +library(RMySQL) UCSCdata <- src_mysql( host="genome-mysql.cse.ucsc.edu", user="genome", @@ -1313,8 +1338,8 @@ KnownGene <- tbl(UCSCdata, "knownGene") # Get the gene name, chromosome, start and end sites for genes on Chromosome 1 Chrom1 <- KnownGene %>% - select( name, chrom, txStart, txEnd ) %>% - filter( chrom == "chr1" ) + select(name, chrom, txStart, txEnd) %>% + filter(chrom == "chr1") @ <>= options(OLD) @@ -1329,20 +1354,20 @@ class(Chrom1) @ \Rindex{mutate()} \Caution[3cm]{The arithmetic operations in this \function{mutate()} command are being executed in SQL, not in \R, and the palette of allowable functions is much smaller. It is not possible, for example, to compute the logarithm of the length here using \function{log()}. For that we must first collect the data into a real data frame.} -<<>>= +<>= Chrom1 %>% mutate(length=(txEnd - txStart)/1000) -> Chrom1l Chrom1l @ -For efficiency, the full data are not pulled from the database until needed (or until we request this using \function{collect()}). This allows us, for example, to inspect the firstfew rows of a potentially large pull from the database without actually having done all ofthe work required to pull that data. +For efficiency, the full data are not pulled from the database until needed (or until we request this using \function{collect()}). This allows us, for example, to inspect the first few rows of a potentially large pull from the database without actually having done all of the work required to pull that data. But certain things do not work unless we collect the results from the data based into an actual data frame. To plot the data using \pkg{lattice} or \pkg{ggplot2}, for example, we must first \function{collect()} it into a data frame. \Rindex{collect()} -<>= +<>= Chrom1df <- collect(Chrom1l) # collect into a data frame -histogram( ~length, data=Chrom1df, xlab="gene length (kb)" ) +histogram( ~ length, data=Chrom1df, xlab="gene length (kb)") @ @@ -1351,19 +1376,15 @@ histogram( ~length, data=Chrom1df, xlab="gene length (kb)" ) %There is an \href{http://csg.sph.umich.edu/docs/R/rsql.html}{online document} %describing this type of manipulation. -\section{Reshaping data} -\authNote{NH to expand} -\authNote{Hadley is working on a new package for tidying data that will replace this.} +\section{Reshaping data with \texttt{tidyr}} - -\function{reshape()} provides a flexible way to change the arrangement of data. -\Rindex{reshape()}% -It was designed for converting between long and wide versions of -time series data and its arguments are named with that in mind. - -A common situation is when we want to convert from a wide form to a -long form because of a change in perspective about what a unit of -observation is. For example, in the \dfn{traffic} data frame, each +\Rindex{tidyr} +Sometimes data come in a shape that doesn't suit our purposes. The \pkg{tidyr} +package includes several functions for tidying data, including +\function{spread()} and \function{gather()}, which can be used to convert +between ``long" and ``wide" formats . +We may want to do this becuase of a change in perspective about what a unit of +observation is, for example. For example, in the \dataframe{traffic} data frame, each row is a year, and data for multiple states are provided. <>= @@ -1371,15 +1392,22 @@ traffic @ We can reformat this so that each row contains a measurement for a -single state in one year. +single state in one year by gathering the states columns. <>= -longTraffic <- - reshape(traffic[,-2], idvar="year", ids=row.names(traffic), - times=names(traffic)[3:6], timevar="state", - varying=list(names(traffic)[3:6]), v.names="deathRate", - direction="long") -head(longTraffic) +require(tidyr) +LongTraffic <- + traffic %>% + select(-cn.deaths) %>% + gather(state, death.rate, ny:ri) +head(LongTraffic) +@ + +This long format allows us to create a plot like this. +<<>>= +xyplot(death.rate ~ year, data = LongTraffic, groups = toupper(state), + type = "l", + auto.key = list(space = "right", lines = TRUE, points = FALSE)) @ We can also reformat the other way, this time having all data for a given state form a row in the data frame. @@ -1389,20 +1417,24 @@ We can also reformat the other way, this time having all data for a given state OLD <- options(width=100) @ <>= -stateTraffic <- reshape(longTraffic, direction='wide', v.names="deathRate", - idvar="state", timevar="year") -stateTraffic +StateTraffic <- + LongTraffic %>% + spread(state, death.rate) +StateTraffic %>% head(3) @ +\noindent +We can create a plot using data in this format as well, but it involves a type of formula +we have not seen before: +<<>>= +xyplot(ri + ny + cn + ma ~ year, data=StateTraffic, type = "l", + auto.key = list(space = "right", lines = TRUE, points = FALSE)) +@ + <>= options(OLD) @ \end{widestuff} -In simpler cases, \function{stack()} or \function{unstack()} may suffice. -\verb!Hmisc! also provides \verb!reShape()! as an alternative -to \verb!reshape()!. -\Rindex{stack()}% -\Rindex{unstack()}% %\subsection{Simple Relational Database Operations} @@ -1489,7 +1521,7 @@ mystats((1:20)^2) \marginnote[-2cm]{There are ways to check the \rterm{class} of an argument to see if it is a data frame, a vector, numeric, etc. A really robust function should check to make sure that the values supplied to the arguments are of appropriate types.}% -The first line says that we are defining a function called \function{mystats()} with one argument, named \variable{x}. The lines surrounded by curly braces give the code to be executed when the function is called. So our function computesthe mean, then the median, then the standard deviation of its argument. +The first line says that we are defining a function called \function{mystats()} with one argument, named \variable{x}. The lines surrounded by curly braces give the code to be executed when the function is called. So our function computes the mean, then the median, then the standard deviation of its argument. But as you see, this doesn't do exactly what we wanted. So what's going on? The value returned by the last line of a function is (by default) returned by the function to its calling environment, where it is (by default) printed to the screen so you can see it. In our case, we computed the mean, median, and standard deviation, but only the standard deviation is being returned by the function and hence displayed. So this function is just an inefficient version of \function{sd()}. That isn't really what we wanted. @@ -1538,6 +1570,9 @@ mystats <- function(x) { } mystats((1:20)^2) @ + +\enlargethispage{1in} + Now the only problem is that we have to remember which number is which. We can fix this by giving names to the slots in our vector. While we're at it, let's add a few more favorites to the list. We'll also add an explicit \function{return()}. \Rindex{return()}% @@ -1550,13 +1585,12 @@ mystats <- function(x) { return(result) } mystats((1:20)^2) -summary(Sepal.Length~Species, data=iris, fun=mystats) -aggregate(Sepal.Length~Species, data=iris, FUN=mystats) +aggregate(Sepal.Length ~ Species, data=iris, FUN=mystats) @ \end{widestuff} -Notice how nicely this works with \function{aggregate()} and with the \function{summary()} function from the \pkg{Hmisc} package. You can, of course, define your own favorite function to use with \function{summary()}. +Notice how nicely this works with \function{aggregate()}. \Rindex{favstats()}% The \function{favstats()} function in the \pkg{mosaic} package includes the quartiles, mean, standard, deviation, sample size and number of missing observations. @@ -1571,7 +1605,21 @@ favstats(Sepal.Length ~ Species, data=iris) options(width=60) @ \end{widestuff} +We can get a version of our new function that works with the formula template +like this. +<<>>= +# first create a version that works on vectors +mystats_ <- function(x, na.rm = TRUE) { + result <- c(min(x, na.rm = na.rm), max(x, na.rm = na.rm), mean(x, na.rm = na.rm), + median(x, na.rm = na.rm), sd(x, na.rm = na.rm)) + names(result) <- c("min","max","mean","median","sd") + return(result) +} +# no create a version that knows the formula template +mystats <- aggregatingFunction1(mystats_, output.multiple = TRUE) +mystats(Sepal.Length ~ Species, data = iris) +@ \authNote{rjp to add a section here showing how to start with a code chunk @@ -1695,7 +1743,6 @@ Instructors often have their own data sets to illustrate points of statistical i There are now many technologies that support such sharing. For the sake of simplicity, we will emphasize three that we have found particularly useful both in teaching statistics and in our professional collaborative work. These are: \begin{itemize} -\item Within \RStudio\ server. \item A web site with minimal overhead, such as provided by Dropbox. \item The services of Google Docs. \item A web-based \RStudio\ server for \R. @@ -1713,12 +1760,13 @@ The \RStudio\ server runs on a Linux machine. Users of \RStudio\ have accounts You may already have a web site. We have in mind a place where you can place files and have them accessed directly from the Internet. For sharing data, it's best if this site is public, that is, -it does not require a login. In this case, \function{read.file()} +it does not require a login for others to access the files you +put there. In this case, \function{read.file()} can read the data into \R\ directly from the URL: <>= Fires <- read.csv("http://www.calvin.edu/~rpruim/data/Fires.csv") head(Fires) -xyplot( Acres/Fires ~ Year, data=Fires, ylab="acres per fire", +xyplot(Acres/Fires ~ Year, data=Fires, ylab="acres per fire", type=c("p","smooth")) @ @@ -1812,110 +1860,9 @@ student. Anything she or he drops into the directory is automatically available to the instructor. The student can also share with specific other students (e.g., members of a project group). -We will illustrate the entire process in the context of the following -example. -\begin{example} -One exercise for students starting out in a statistics course is to -collect data to find out whether the ``close door'' button on an -elevator has any effect. This is an opportunity to introduce simple -ideas of experimental design. But it's also a chance to teach about -the organization of data. - -Have your students, as individuals or small groups, study a particular -elevator, organize their data into a spreadsheet, and hand in their -individual spreadsheet. Then review the spreadsheets in class. You -will likely find that many groups did not understand clearly the -distinction between cases and variables, or coded their data in -ambiguous or inconsistent ways. - -Work with the class to establish a consistent scheme for the variables -and their coding, e.g., a variable \VN{ButtonPress} with levels -``Yes'' and ``No'', a variable \VN{Time} with the time in seconds -from a fiducial time (e.g. when the button was pressed or would have -been pressed) with time measured in seconds, and variables \VN{ElevatorLocation} -and \VN{GroupName}. Create a spreadsheet -with these variables and a few cases filled in. Share it with the class. - -Have each of your students add their own data to the class data -set. Although this is a trivial task, having to translate their -individual data into a common format strongly reinforces the -importance of a consistent measurement and coding system for recording -data. - -Once you have a spreadsheet file in GoogleDocs, you will want to open -it in \R. This can be exported as a csv file, then -open it using the csv tools in \R, such as \function{read.csv}. -%But there are easier ways that let you work with the data ``live.'' - -%\paragraph{In the web-server version of \RStudio,} described below, you can -% use a menu item to locate and load your spreadsheet. -% -%\begin{center} -% \includegraphics[width=3in]{images/google-spreadsheet1.png} -%\end{center} - -%\paragraph{If you are using other \R\ interfaces,} you must first use the Google -% facilities for publishing documents. - -%\begin{enumerate} - %\item From within the document, use the ``Share'' dropdown menu and - %choose ``Publish as a Web Page.'' - %\item Press the ``Start Publishing'' button in the ``Publish to the - %web'' dialog box. (See figure \ref{fig:publish-google}.) - %\item In that dialog box, go to ``Get a link to the published - %data.'' Choose the csv format and copy out the link that's - %provided. You can then publish that link on your web site, or via - %course-support software. Only people with the link can see the - %document, so it remains effectively private to outsiders. -%\end{enumerate} - - -%\begin{figure} -%\begin{center} - %\includegraphics[width=4.5in]{images/publishing-google1.png} -%\end{center} -%\caption{\label{fig:publish-google}Publishing a Google Spreadsheet so that it can be read - %directly into \R.} -%\end{figure} - -Direct communication with GoogleDocs requires facilities that are not present in the base version of \R, but are available through the \pkg{RCurl} package. In order to make these readily available to students, the \pkg{mosaic} package contains a function that takes the quoted (and cumbersome) string with the Google-published URL and reads the corresponding file into a data frame. \pkg{RCurl} neads to be installed for this to work, and will be loaded if it is not already loaded when \function{fetchGoogle()} is called. - -\medskip - -\begin{widestuff} -<>= -OLD <- options(width=90) -@ -<<"read-from-google2",eval=FALSE,tidy=FALSE>>= -elev <- fetchGoogle( -"https://spreadsheets.google.com/spreadsheet/pub?hl=en&hl=en& -key=0Am13enSalO74dEVzMGJSMU5TbTc2eWlWakppQlpjcGc&single=TRUE&gid=0&output=csv") -@ -\end{widestuff} - -\hfill -<<"read-from-google1",echo=FALSE,results="hide">>= -elev <- fetchGoogle("https://spreadsheets.google.com/spreadsheet/pub?hl=en&hl=en&key=0Am13enSalO74dEVzMGJSMU5TbTc2eWlWakppQlpjcGc&single=TRUE&gid=0&output=csv") -@ - -\hfill\newpage - -\begin{widestuff} -<<"read-from-google3">>= -head(elev) -@ -<>= -options(OLD) -@ -\end{widestuff} - -\TeachingTip[2.5cm]{Another option is to get shorter URLs using -a service like \url{tinyurl.com} or \url{bitly.com}.} -Of course, you'd never want your students to type that URL by hand; -you should provide it in a copy-able form on a web site or within a -course support system. -\end{example} +Data can be read directly from google sheets using the \pkg{googlesheets} package. +This works much like \function{read_excel()} from the \pkg{readxl} package. \section{Additional Notes on R Syntax} @@ -1944,7 +1891,7 @@ text2 <- 'Do you use "scare quotes"?' used. One common reason for this is a typing error. This is easily corrected by retyping the name with the correct spelling. <<>>= -histogram( ~ aeg, data=HELPrct ) +histogram( ~ aeg, data=HELPrct) @ Another reason for an object-not-found error is using unquoted @@ -2003,11 +1950,11 @@ Here is a brief summary of the commands introduced in this chapter. \begin{widestuff} <>= -source( "file.R" ) # execute commands in a file +source("file.R") # execute commands in a file x <- 1:10 # create vector with numbers 1 through 10 -M <- matrix( 1:12, nrow=3 ) # create a 3 x 4 matrix -data.frame(number = 1:26, letter=letters[1:26] ) # create a data frame +M <- matrix(1:12, nrow=3) # create a 3 x 4 matrix +data.frame(number = 1:26, letter=letters[1:26]) # create a data frame @ \end{widestuff} @@ -2018,8 +1965,8 @@ length(x) # returns length of vector or list dim(HELPrct) # dimension of a matrix, array, or data frame nrow(HELPrct) # number of rows ncol(HELPrct) # number of columns -names( HELPrct ) # variable names in data frame -row.names( HELPrct ) # row names in a data frame +names(HELPrct) # variable names in data frame +row.names(HELPrct) # row names in a data frame attributes(x) # returns attributes of x @ \end{widestuff} @@ -2046,7 +1993,7 @@ sort(x) # returns elements of x in sorted orde order(x) # x[ order(x) ] is x in sorted order rev(x) # returns elements of x in reverse order diff(x) # returns differences between consecutive elements -paste( "Group", 1:3, sep="" ) # same as c("Group1", "Group2", "Group3") +paste("Group", 1:3, sep="") # same as c("Group1", "Group2", "Group3") @ \end{widestuff} @@ -2056,15 +2003,12 @@ write.table(HELPrct, file="myHELP.txt") # write data to a file write.csv(HELPrct, file="myHELP.csv") # write data to a csv file save(HELPrct, file="myHELP.Rda") # save object(s) in R's native format -modData <- mutate( HELPrct, old = age > 50 ) # add a new variable to data frame -women <- subset( HELPrct, sex=='female' ) # select only specified cases -favs <- subset( HELPrct, select=c('age','sex','substance') ) # keep only 3 columns +modData <- HELPrct %>% mutate(old = age > 50) # add a new variable to data frame +women <- HELPrct %>% filter(sex=='female') # select only specified cases +favs <- HELPrct %>% select(age, sex, substance) # keep only 3 columns -trellis.par.set(theme=col.mosaic()) # choose theme for lattcie graphics -show.settings() # inspect lattice theme -@ -<>= -fetchGoogle( ... ) # get data from google URL +trellis.par.set(theme=col.mosaic()) # choose theme for lattcie graphics +show.settings() # inspect lattice theme @ \end{widestuff} diff --git a/Starting/RForStudents.Rnw b/Starting/RForStudents.Rnw index 6cf826e..4e26fde 100644 --- a/Starting/RForStudents.Rnw +++ b/Starting/RForStudents.Rnw @@ -1,12 +1,10 @@ <>= opts_chunk$set( fig.path="figures/RForStudents-" ) -set_parent('Master-Starting.Rnw') +set_parent("MOSAIC-StartTeaching.Rnw") set.seed(123) @ - - \chapter[What Students Need to Know about R]{What Students Need to Know About R\\ \& How to Teach It} \label{chap:RForStudents} @@ -14,6 +12,9 @@ set.seed(123) In Chapter~\ref{chap:RStudio}, we give a brief orientation to the \RStudio\ IDE and what happens in each of its tabs and panels. +\Pointer{Be sure to look at \emph{A Student's Guide to R} as well. That little +book contains a brief summary of all the commands needed to perform the statistical +analyses typically seen in the first two statistics courses.}% In Chapter~\ref{chap:Template}, we show how to make use of a common template for graphical summaries, numerical summaries, and modeling. In this chapter we cover some additional things that are important for @@ -41,6 +42,11 @@ This will generally determine which \R\ function to use. This will determine the inputs to the function. \end{enumerate} +When your students preface their questions about \R{} by telling you +what they want \R{} to do and what \R{} needs to know to that, then +you know they have internalized these two questions. + +\newpage \section{Four Things to Know About \R} @@ -208,10 +214,10 @@ install the mosaic package using \Rindex{install_github()} \Rindex{devtools} <>= -# if you haven't already installed this package +# if you haven't already installed devtools install.packages("devtools") require(devtools) -install_github("mosaic", "rpruim") +install_github("ProjectMOSAIC/mosaic") @ Occasionally you might find a package of interest that is not available via @@ -257,9 +263,15 @@ This will give you the documentation for the object you are interested in. \Rindex{apropos()} If you don't know the exact name of a function, you can give part of the name and \R\ will find all functions that match. Quotation marks are mandatory here. +\Pointer{Notice that \code{tally} appears twice. That is because there are two +\code{tally()} functions, one in the \pkg{mosaic} package and one in the +\code{dplyr} package. The \code{find()} function can be used to determine +which package(s) a function belongs to. In this case, the \pkg{mosaic} package +takes care of navigating among the two versions of \code{tally()}. In other cases, +you may need to explicitly specify which package's function you want.} <>= -apropos('tally') # must include quotes. single or double. +apropos('tally') # must include quotes. single or double. @ \subsection{\texttt{??} and \texttt{help.search()}} @@ -333,12 +345,11 @@ observational unit). \end{boxedText} \TeachingTip[-1cm]{To help students keep variables and data frames straight, and to make it easier to remember the names, we have adopted the convention that data -frames in the \pkg{mosaic} package are capitalized and variables (usually) are +frames in the \pkg{mosaicData} package are capitalized and variables (usually) are not. This convention has worked well, and you may wish to adopt it for your data sets as well.} -\pkg{Births78} -The \dataframe{Births78} data frame contains three variables measured for each +The \dataframe{Births78} data frame contains four variables measured for each day in 1978. There are several ways we can get some idea about what is in the \dataframe{Births78} data frame. @@ -363,10 +374,26 @@ sample(Births78, 4) # show 4 randomly selected rows summary(Births78) # provide summary info about each variable @ +\enlargethispage{1in} + +<>= +oldwidth <- options("width") +options(width=90) +@ + +\Rindex{inspect()} +<>= +inspect(Births78) # provide summary info about each variable +@ + +<>= +options(oldwidth) +@ + \Rindex{str} \begin{widestuff} <>= -str(Births78) # show the structure of the data frame +str(Births78) # show the structure of any R object @ \end{widestuff} @@ -379,8 +406,8 @@ In interactive mode, you can also try ?Births78 @ to access the documentation for the data set. This is also available in the \tab{Help} tab. -Finally, the \tab{Environment} tab provides a list of data in the workspace. Clicking on -one of the data sets brings up the same data viewer as +Finally, the \tab{Environment} tab provides a list of data in the global environment. +Clicking on one of the data sets brings up the same data viewer as \Rindex{View()} <>= View(Births78) @@ -389,8 +416,8 @@ View(Births78) \authNote{add pointer to fetchData()?} %\subsection{Getting at the Variables} -We can gain access to a single variable in a data frame using the \code{\$} operator or, alternatively, using -the \function{with()} function. +We can gain access to a single variable in a data frame using the \code{\$} operator or, +alternatively, using the \function{with()} function. \Rindex{\$} \Rindex{with()} @@ -457,7 +484,7 @@ data(package="mosaic") @ Typically% -\Note{This depends on the package. Most package authors +\Pointer{This depends on the package. Most package authors set up their packages with ``lazy loading'' of data. If they do not, then you need to use \function{data()} explicitly.}% you can use data sets by simply typing their names. But if you have already @@ -473,13 +500,13 @@ data(Births78) \Caution{If two packages include data sets with the same name, you may need to specify which package you want the data from with \code{ -data(Births78, package="mosaic") +data(Births78, package="mosaicData") } }% There is no visible effect of this command, but the \dataframe{Births78} data frame -has now been reloaded from the \pkg{mosaic} package and is ready for use. Anything you +has now been reloaded from the \pkg{mosaicData} package and is ready for use. Anything you may have previously stored in a variable with this same name is replaced by -the version of the data set stored with in the \pkg{mosaic} package. +the version of the data set stored with in the \pkg{mosaicData} package. \subsection{Using Your Own Data} @@ -502,20 +529,27 @@ The \pkg{mosaic} package includes a function called \function{read.file()} that \myindex{Excel} Since most software packages can export to csv format, this has become a -sort of \emph{lingua franca} for moving data between packages. Data in excel, for example, +sort of \emph{lingua franca} for moving data between packages. +Data in excel, for example, can be exported as a csv file for subsequent reading in \R.% \Rindex{resample()} -\Rindex{gdata} -\Rindex{read.xls()} -\Caution{There is a conflict between the \function{resample()} functions in -\pkg{gdata} and \pkg{mosaic}. If you want to use \pkg{mosaic}'s \function{resample()}, -be sure to load \pkg{mosaic} \emph{after} you load \pkg{gdata}.} -If you have python installed on your system, you can also use -\function{read.xls()} from the \pkg{gdata} package to read read directly from -Excel files without this extra step. - -Each of these functions accepts a URL as well as a file name, which provides an -easy way to distribute data via the Internet: +\Rindex{readxl} +\Rindex{read_excel} +\Rindex{haven} +%\Caution{There is a conflict between the \function{resample()} functions in +%\pkg{gdata} and \pkg{mosaic}. If you want to use \pkg{mosaic}'s \function{resample()}, +%be sure to load \pkg{mosaic} \emph{after} you load \pkg{gdata}.} +%If you have python installed on your system, you can also use +%\function{read.xls()} from the \pkg{gdata} package to read read directly from +%Excel files without this extra step. +There is a danger in doing this, however, since some types of data don't export +from Excel they way you might expect. A safer way to read excel files is to use +the \code{read_excel()} function from the \pkg{readxl} package. +The \pkg{haven} package includes utilities for reading data in several other formats +that are exported from other statistics packages like SAS and Stata. + +Some of these data ingesting functions accept a URL as well as a file name, +which provides an easy way to distribute data via the Internet: \authNote{Should we change URLs to something at mosaic-web.org?} \Rindex{read.table()} \Rindex{head()} @@ -524,7 +558,8 @@ easy way to distribute data via the Internet: \begin{widestuff} <>= -births <- read.file('http://www.calvin.edu/~rpruim/data/births.txt', header=TRUE) +births <- + read.table('http://www.calvin.edu/~rpruim/data/births.txt', header=TRUE) head(births) # live births in the US each day of 1978. @ \end{widestuff} @@ -534,12 +569,13 @@ We can omit the \option{header=TRUE} if we use \function{read.file()} \begin{widestuff} <>= -births <- read.file('http://www.calvin.edu/~rpruim/data/births.txt') +births <- + read.file('http://www.calvin.edu/~rpruim/data/births.txt') @ \end{widestuff} %\Rstudio\ will help you import your own data. To do so use the ``Import Dataset" -%button in the \tab{Workspace} tab. You can load data from text files, from the web, or from +%button in the \tab{Environment} tab. You can load data from text files, from the web, or from %google spreadsheets. \subsection{Importing Data in \RStudio} @@ -550,11 +586,10 @@ they are there you can include them in posts, etc.}% The \RStudio\ interface provides some GUI tools for loading data. If you are using the \RStudio\ server, you will first need to upload the data to the server (in the \tab{Files} tab), and then import the data into your \R\ -session (in the \tab{Workspace} tab).% +session (in the \tab{Environment} tab).% If you are running the desktop version, the upload step is not needed. - \subsection{Working with Pretabulated Data} \InstructorNote[-1cm]{Even if you use \RStudio\ GUI for interactive work, you will want to know how to use functions like \function{read.csv()} for working @@ -568,7 +603,8 @@ of \function{c()}, \function{rbind()} and \function{cbind()}: \Rindex{c()} \Rindex{cbind()} \Rindex{rbind()} -\TeachingTip[2cm]{This is an important technique if you use a text book that presents pre-tabulated categorical data.} +\TeachingTip[2cm]{This is an important technique if you use a text book that +presents pre-tabulated categorical data.} <>= myrace <- c( NW=67, W=467 ) # c for combine or concatenate @@ -584,9 +620,9 @@ mycrosstable <- rbind( ) mycrosstable @ - +\noindent Replacing \function{rbind()} with \function{cbind()} will allow you to give the data column-wise instead. -\TeachingTip[-1cm]{If plotting pre-tabulated categorical data is important, you probably want to provide your students with a wrapper function to simplify all this. We generally avoid this situation by provided the data in raw format or by presenting an analysing the data in tables without using graphical summaries.} +\TeachingTip[-3cm]{If plotting pre-tabulated categorical data is important, you probably want to provide your students with a wrapper function to simplify all this. We generally avoid this situation by provided the data in raw format or by presenting an analysing the data in tables without using graphical summaries.} This arrangement of the data would be sufficient for applying the Chi-squared test, but it is not in a format suitable for plotting with \pkg{lattice}. Our cross table is still missing a bit of information -- the names of the variables being stored. We can add this information if we convert it to a table: @@ -611,12 +647,48 @@ mycrosstable \Rindex{bargraph()} We can use \function{barchart()} instead of \function{bargraph()} to plot data already tabulated in this way, but first we need yet one more transformation. + +\enlargethispage{1in} \Rindex{head()} \Rindex{as.data.frame()} <<>>= head(as.data.frame(mycrosstable)) @ +\begin{problem} +The table below is from a study of nighttime lighting in infancy and +eyesight (later in life). +% latex table generated in R 2.12.1 by xtable 1.5-6 package +% Fri Feb 4 15:46:48 2011 +\begin{center} +\begin{tabular}{rrrr} + \hline + & no myopia & myopia & high myopia \\ + \hline +darkness & 155 & 15 & 2 \\ + nightlight & 153 & 72 & 7 \\ + full light & 34 & 36 & 3 \\ + \hline +\end{tabular} +\end{center} + +\begin{enumerate} +%\item +%Do you think this was an experiment or an observational study? Why? +\item +Recreate the table in \R. %Copy and paste the results into your Word document. +\item +What percent of the subjects slept with a nightlight as infants? + +There are several ways to do this. You could use \R\ as a calculator to do the arithmetic. +You can save some typing if you use the function \function{tally()}. See +\code{?tally} for documentation. +%If you just want row and column totals added to the table, see \verb!mar_table()! +%in the \verb!vcd! package. +\item Create a graphical representation of the data. What does this plot reveal? +\end{enumerate} +\end{problem} + \newpage <>= @@ -645,8 +717,8 @@ be trained to follow good data organization practices: \item Use each subsequent row for one observational unit. \item Give the resulting data frame a good name. \end{itemize} -Scientists may be disappointed that \R\ data frames don't keep track of additional -information, like the units in which the observations are recorded. +Some scientists may be disappointed that \R\ data frames don't keep track +of additional information, like the units in which the observations are recorded. This sort of information should be recorded, along with a description of the protocols used to collect the data, observations made during the data recording process, etc. @@ -686,6 +758,8 @@ abdData(2) # all data sets in chapter 2 %For information on how to create such packages, consult the \textit{Writing R Extensions} manual %on CRAN. +\enlargethispage{1.5in} + \section{Review of \R\ Commands} @@ -705,7 +779,8 @@ Here is a brief summary of the commands introduced in this chapter. \Rindex{read.table()} \Rindex{read.csv()} \Rindex{read.file()} -<>= +\Rindex{inspect()} +<>= require(mosaic) # load the mosaic package require(mosaicData) # load the mosaic data sets answer <- 42 # store the number 42 in a variable named answer @@ -716,12 +791,15 @@ data(iris) # (re)load the iris data set names(iris) # see the names of the variables in the iris data head(iris) # first few rows of the iris data set sample(iris, 3) # 3 randomly selected rows of the iris data set -summary(iris) # summarize each variables in the iris data set +inspect(iris) # summarize each variable in the iris data set +summary(iris) # summarize each variable in the iris data set str(iris) # show the structure of the iris data set mydata <- read.table("file.txt") # read data from a text file mydata <- read.csv("file.csv") # read data from a csv file mydata <- read.file("file.txt") # read data from a text or csv file +require(readxl) +mydata <- read_excel("file.xlsx") # read data from an Excel file @ \end{widestuff} diff --git a/Starting/RStudio.Rnw b/Starting/RStudio.Rnw index 26b4616..70ad818 100644 --- a/Starting/RStudio.Rnw +++ b/Starting/RStudio.Rnw @@ -1,8 +1,8 @@ <>= -opts_chunk$set( fig.path="figures/RStudio-" ) -set_parent('Master-Starting.Rnw') +set_parent('MOSAIC-StartTeaching.Rnw') set.seed(123) +knitr::opts_chunk$set( fig.path="figures/RStudio-" ) @ \chapter{Getting Started with RStudio} @@ -21,6 +21,10 @@ interface to \R\ that has several advantages over other the default \R\ interfac \item \RStudio\ can run in a web browser. + \Note{Using \RStudio\ in a browser is like Facebook for statistics. + Each time the user returns, the previous session is restored and they + can resume work where they left off. Users can login from any device + with internet access.}% In addition to stand-alone desktop versions, \RStudio\ can be set up as a server application that is accessed via the internet. Installation is straightforward for anyone with experience administering a Linux system. @@ -39,10 +43,6 @@ interface to \R\ that has several advantages over other the default \R\ interfac With a little advanced set up, instructors can save the history of their classroom \R\ use and students can load those history files into their own environment.% - \Note{Using \RStudio\ in a browser is like Facebook for statistics. - Each time the user returns, the previous session is restored and they - can resume work where they left off. Users can login from any device - with internet access.}% \item \RStudio\ provides support for reproducible research. @@ -55,7 +55,7 @@ interface to \R\ that has several advantages over other the default \R\ interfac over the output format. Depending on the level of the course, students can use either of these for homework and projects. \authNote{NH (via rjp): Add some pointers to more information?} - \marginnote{To use Markdown or \pkg{knitr}/\LaTeX\ requires that + \Note{To use Markdown or \pkg{knitr}/\LaTeX\ requires that the \pkg{knitr} package be installed on your system. See Section~\ref{sec:installingPackages} for instructions on installing packages.} @@ -150,8 +150,9 @@ you will see something like the following. Notice that \Rstudio\ divides its world into four panels. Several of the panels are further subdivided into multiple tabs. Which tabs appear in which panels can be customized by the user. -\marginnote{We find it convenient to put the console in the upper left rather than the default location (lower right) so that students can see it better when we project our \R\ -session in class.} +\TeachingTip{We find it convenient to put the console in the upper left rather +than the default location (lower left) so that students can see it better when +we project our \R{} session in class.} \section{Using R as a Calculator in the Console} \R\ can do much more than a simple calculator, and we will introduce @@ -183,19 +184,17 @@ You can save values to named variables for later reuse. <>= product = 15.3 * 23.4 # save result product # display the result -product <- 15.3 * 23.4 # <- can be used instead of = +product <- 15.3 * 23.4 # <- instead of = product @ \TeachingTip[-2cm]{It's best to settle on using one or the other of the right-to-left assignment operators rather than to switch -back and forth. The authors have different preferences: -two of us find the equal sign to be simpler for students and more -intuitive, while the other prefers the arrow operator because it -represents visually what is happening in an assignment, because it -can also be used in a left to right manner, and because it makes -a clear distinction between the assignment operator, the use of \code{=} -to provide values to arguments of functions, and the use of \code{==} to test -for equality.}% +back and forth. Here we will adopt the arrow operator because +it represents visually what is happening in an assignment, +because it can also be used in a left to right manner, +and because it makes a clear distinction between the assignment operator, +the use of \code{=} to provide values to arguments of functions, +and the use of \code{==} to test for equality.}% Once variables are defined, they can be referenced in other operations and functions. @@ -203,11 +202,11 @@ and functions. <>= -0.5 * product # half of the product -log(product) # (natural) log of the product -log10(product) # base 10 log of the product -log2(product) # base 2 log of the product -log(product, base=2) # another way for base 2 log +0.5 * product # half of the product +log(product) # (natural) log of the product +log10(product) # base 10 log of the product +log2(product) # base 2 log of the product +log(product, base=2) # another way for base 2 log @ \authNote{can we come up with a better (e.g. less mathematical) example?} @@ -215,7 +214,8 @@ The semi-colon can be used to place multiple commands on one line. One frequent use of this is to save and print a value all in one go: <>= -product <- 15.3 * 23.4; product # store and show result +# store and show result +product <- 15.3 * 23.4; product @ @@ -240,26 +240,41 @@ for students to create homework and reports that include text, \R\ code, \R\ out To create a new RMarkdown file, select \tab{File}, then \tab{New File}, then \tab{RMarkdown}. The file will be opened with a short template document that illustrates the mark up language. -Click on \tab{Compile HTML} to convert this to an HTML file. There is a button the -provides a brief description of the mark commands supported, and the \RStudio\ web site -includes more extensive tutorials on using RMarkdown. +If you can click on \tab{From Template} before creating the file, you will be given a list +of template documents available in packages. +If the the \pkg{mosaic} package is loaded, this list will include templates that +make sure the \pkg{mosaic} package is loaded and change the defaults size for +plots to be somewhat smaller than the generic RStudio default. +The fancy version demonstrates many of the features of RMarkdown. +(The \RStudio\ web site includes extensive tutorials on using RMarkdown that +demonstrate a wider range of features.) +The plain templates are designed to quickly create new documents starting +from a nearly blank slate. + +The process of running the \R{} code and combining text, \R{} code, output, and graphics +into a single file is called ``knitting". +Click on \tab{Knit} to convert the RMarkdown document into an HTML, PDF, +or Word file. + -\Caution{RMarkdown, and \pkg{knitr}/\LaTeX\ files do not have access to the console environment, -so the code in them must be self-contained.} % It is important to remember that unlike \R\ scripts, which are executed in the console and have access to the console environment, RMarkdown and \pkg{knitr}/\LaTeX\ -files do not have access to the console environment This is a good feature because it forces +files do not have access to the console environment. +\Caution{RMarkdown, and \pkg{knitr}/\LaTeX\ files do not have access +to the console environment, so the code in them must be self-contained.}% +This is a good feature because it forces the files to be self-contained, which makes them transferable and respects good reproducible research practices. But beginners, especially if they adopt a strategy of trying things out in the console and copying and pasting successful code from the console to their file, will often create files that are incomplete and therefore do not compile correctly. -One good strategy for getting student to use RMarkdown is to provide them with a template -that includes the boiler plate you want them to use, loads any \R\ packages that they will -need, sets any \pkg{knitr} or \R\ settings they way you prefer them, and has placeholders for the -work you want them to do. +One good strategy for getting students to use RMarkdown is to provide them with an example +document that includes the boiler plate you want them to use, +loads any \R\ packages that they will need, +sets any \pkg{knitr} or \R\ settings they way you prefer them, +and has placeholders for the work you want them to do. \section{The Other Panels and Tabs} @@ -321,14 +336,15 @@ can stay in the environment essentially indefinitely. \marginnote{If you haven't been entering these example commands at your console, go back and do it!} Plots created in the console are displayed in the \tab{Plots} tab. For example, -<<>>= -# this will make lattice graphics available to the session +<>= +# this will make lattice graphics available require(mosaic) xyplot( births ~ dayofyear, data=Births78) @ +\noindent will display the number of births in the United States for each day in 1978. From the \tab{Plots} tab, you can navigate to previous plots and also export plots -in various formats after interactively resizing them. +in various formats or copy them to the cliboard after interactively resizing them. % this fixes bad spacing -- but I don't know why the spacing was bad diff --git a/Starting/SBI.Rnw b/Starting/SBI.Rnw new file mode 100644 index 0000000..ba5cbda --- /dev/null +++ b/Starting/SBI.Rnw @@ -0,0 +1,916 @@ +<>= +set_parent("MOSAIC-StartTeaching.Rnw") +knitr::opts_chunk$set(fig.path="figures/SBI-") +knitr::opts_chunk$set(size="small") +knitr::opts_chunk$set(fig.width = 3, fig.height = 2) +require(mosaic) +require(mosaicData) +set.seed(123) +options(digits = 3) +trellis.par.set(theme = col.mosaic()) +require(NHANES) +require(Lock5withR) +@ + + +\chapter{Simulation-Based Inference} + +Resampling approaches have become increasingly important in statistical +education\cite{Tintle:TAS:2015}\cite{Hesterberg:2015}. +The \pkg{mosaic} package provides simplified functionality to support teaching inference +based on randomization tests and bootstrap methods. Our goal is to focus attention +on the important parts of these techniques (e.g., where randomness enters in and how to +use the resulting distribution) while hiding some of the technical details +involved in creating loops and accumulating values. + +\section{Staring Early} + +One of the advantages of simulation-based inference is that one can start teaching +inference early in the course. +Section~\ref{sec:lady-tasting-tea} describes an example +(based on Fisher's lady tasting tea) that we have often used on the first day of class. +Textbooks that use a simulation-based approach also begin their discussion of the +inference process immediately, using other examples.\cite{Lock5:2012}\cite{Tintle:ISI:2015} +Even when teaching a more traditional course, simulation of the lady tasting tea or +some other example can be introduced early in the course to help students begin +to understand the key ideas involved in hypothesis testing +and estimation. + +\section{Hypothesis Tests} + +Hypothesis testing can be thought of as a 4-step process: +\begin{enumerate} +\item + State the null and alternative hypotheses. +\item + Compute a test statistic. +\item + Determine the p-value. +\item + Draw a conclusion. +\end{enumerate} + +In a traditional introductory statistics course, once this general framework has been +mastered, the main work for students is in applying the correct formula to compute +the standard test statistics in step 2 and using a table or computer to determine +the p-value based on the known (usually approximate) theoretical distribution of +the test statistic under the null hypothesis. + +In a simulation-based approach, steps 2 and 3 change. In Step 2, it is no longer +required that the test statistic be normalized to conform with a known, named distribution. +Instead, natural test statistics, like the difference between two sample means + +\[ +\overline{y}_1 - \overline{y}_2 +\] +can be used instead of the standard two-sample $t$ test statistic +\[ +\frac{ \overline{y}_1 - \overline{y}_2 } + { \sqrt{ \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \;. +\] + +In Step~3, we use randomization to approximate the sampling distribution of the +test statistic. Our lady tasting tea example demonstrates how this can be +done from first principles as early as the first day of class.% +\footnote{See Section~\ref{sec:lady-tasting-tea}.} +This example is a bit unusual, however. +Because the sampling distribution is so simple, the simulation +required to create a randomization distribution +is completely specified without reference to the data: +It's a binomial distribution with parameters determined by +the sample size and the null hypothesis, and we can simulate +it with \code{rflip()}. +%There is only one distribution for a given +%proportion, but there are many distributions that can have +%a specified mean. Similarly, if our null hypothesis is that +%two proportions are equal, this doesn't specify what they +%are equal to. + +More typically, we will use randomization to create new simulated data +sets that are like our original data in some ways, but make the null +hypothesis true. For each simulated data set, we calculate our +test statistic, just as we did for the original sample. +Together, this collection of test statistics computed from the +simulated samples constitute our randomization distribution. + +When creating a randomization distribution, +we will attempt to satisfy 3 guiding principles. + +\begin{enumerate} +\item Be consistent with the null hypothesis. + +We need to simulate a world in which the null hypothesis is true. +If we don’t do this, we won’t be testing our null hypothesis. + +\item Use the data in the original sample. + +The original data should shed light on some aspects of the distribution +that are not determined by null hypothesis. For example, +a null hypothesis about a mean doesn't tell us about the shape +of the population distribution, but the data give us some indication. + +\item +Reflect the way the original data were collected. +\end{enumerate} + +\subsection{Permutations tests using shuffle()} +\myindex{permuation test}% +\Rindex{shuffle()}% +\Rindex{sample()}% +\Rindex{hypothesis test}% + +The \pkg{mosaic} package provides \code{shuffle()} as a synonym for \code{sample()}. +When used without additional arguments, this will permute its first argument. +<<>>= +shuffle(1:10) +shuffle(1:10) +@ + +Applying \function{shuffle()} to an explanatory variable allows us to test the null +hypothesis that the explanatory variable has, in fact, no explanatory power. +This idea can be used to test +\begin{itemize} +\item +the equivalence of two or more proportions, +\item +the equivalence of two or more means, +\item +whether a regression parameter is 0. +\end{itemize} + +For example, let's test whether young men and women have the same mean body temperature +using a data set that contains body temperatures for 50 college students, 25 men and +25 women. + +\medskip + +\begin{widestuff} +<<>>= +require(Lock5withR) +inspect(BodyTemp50) +@ +\end{widestuff} + +\begin{enumerate} +\item State the null and alternaive hypotheses. + +\begin{itemize} +\item $H_0$: mean body temperature is the same for males and females. +\item $H_a$: mean body temperature differs between males and females. +\end{itemize} + +\item Compute a test statistic. +\Rindex{diffmean()}% + +\medskip + +\begin{widestuff} +<<>>= +favstats( BodyTemp ~ Sex, data = BodyTemp50) +T <- diffmean( BodyTemp ~ Sex, data = BodyTemp50); T +@ +\end{widestuff} + +\item Use randomiztion to compute a p-value. +\myindex{p-value} + +\medskip + +\begin{widestuff} +<<>>= +Temp2.Null <- + do(1000) * diffmean( BodyTemp ~ shuffle(Sex), data = BodyTemp50) +histogram( ~ diffmean, data = Temp2.Null, center = 0, v = 0.176) +tally( ~ (diffmean >= T), data = Temp2.Null) +prop( ~ (diffmean >= T), data = Temp2.Null) +@ +\end{widestuff} + +\item Draw a conclusion. + +The p-value is large, so these data offer no reason to reject the hpythesis +that male and female college students have the same mean body temperature. +\end{enumerate} + +\subsection{Computing p-values} + +In the preceding example, +we hardly needed to compute a p-value because the histogram +clearly showed that the observed test statistic (0.176) would not be unusual even if +the null hypothesis were true, +so these data don't offer any reason to reject the null hypothesis that male +and female college students have the same mean body temperature. + +Nevertheless, there are two issues related to p-value calculations that we want to address +with this example: including the observed test statistic in the null distribution, +and calculating 2-sided p-values. +\myindex{p-value!2-sided}% + +\Caution[-1cm]{If you are using a text book that covers randomization tests, you will +need to check whether they include the test statistic computed from the original +data in the null distribution or not.} +If the null hypothesis is true, then not only our randomly generated data, but also +the original data were generated in a world in which the null hypothesis is true. +So it makes sense to add the original test statistic to the randomization distribution +before calculating the p-value. +This has two advantages. First, it ensures that our type I error rate +is no larger than the nominal rate. Second, it avoids reporting a p-value of 0 since there +will always be at least one test statistic at least as extreme as the one computed from +the original data, namely the one computed from the original data. + +\Caution{Although using 999 or 9999 replicates results in p-values that are +``round numbers", there is some risk that students will use the 999 vs 1000 +distinction as their primary way to tell whether you are creating a randomization +dsitribution or a bootstrap distribution.}% +\Rindex{prop1()}% +To simplify this calculation, we may choose to use 999 or 9999 replicates instead of +1000 or 10,000. The \pkg{mosaic} package also includes the \function{prop1()} function +which adds an additional count to both the numerator and denominator for the purpose +of automating this sort of p-value calculation. This will result in a slightly +larger (one-sided) p-value. + +<<>>= +prop1( ~ (diffmean >= T), data = Temp2.Null) +@ +\noindent +The only challenge for the instructor is to decide if and when to introduce +this minor change to the p-value calculation. + +But we need a two-sided p-value given our alternative hypothesis. +The preferred way to calculate 2-sided p-values is also the simplest: +just double the 1-sided p-value. +<<>>= +2 * prop1( ~ (diffmean >= T), data = Temp2.Null) +@ + +An alternative approach sometimes seen would add the proportion of the randomization +distribution that is below $-T = \Sexpr{- T}$. For a symmetric randomization distribution, +this should give a very similar result, but it does not perform as well when the +randomization distribution is skewed, is slightly more difficult to compute, and is not +transformation invarient, so tests that are equivalent as 1-sided tests might not result +in equivalent 2-sided tests. +It seems there is no reason to introduce this method to students. +\TeachingTip{But this alternative might be covered in the text book you +are using, so students might use it even if you don't teach it.} + +\subsection{Some additional examples} + +The technique of shuffling an explanatory variable can be applied to a wide range of +situations. The following templates illustrate the similarity among these. + + + +\Rindex{diffprop()}% +\Rindex{diffmean()}% +\Rindex{chisq()}% +\Rindex{tally()}% +\Rindex{lm()}% + +\medskip + +\begin{widestuff} +<>= +Two.Proportions <- do(999) * diffprop(y ~ shuffle(x), data = Data) +Two.Means <- do(999) * diffmean(y ~ shuffle(x), data = Data) +Linear.model <- do(999) * lm(y ~ shuffle(x) + a, data = Data) +Two.Way.Table <- do(999) * chisq(y ~ shuffle(x), data = Data) +@ +\end{widestuff} + +\Note[1.1cm]{The \function{chisq()} function computes the chi-squared statistic +either from a formula and data frame, from a table produced by \function{tally()}, +or from an object produced by \function{chisq.test()}.} + + +As an example, let's consider the proportion of subjects in the Health Evaluation +and Linkage to Primary Care who were admitted to the substance abuse program +for each of three substances: alcohol, cocaine, and heroin. We'd like to know if +there is evidence that these proportions differ for men and for women. +In our data set, we observe modest differences. + +<<>>= +tally( substance ~ sex, data = HELPrct, + format="prop", margins = TRUE) +@ +\noindent +Could those differences be attributed to chance? Or do these results provide +reliable evidence that the drug of choice varies (a bit) between men and women? + +We can simulate a world in which the proportions vary only because of random sampling +variability using \function{shuffle()} to permute the \variable{sex} (or equivalently +\variable{substance}) labels. + +\medskip + +\begin{widestuff} +<<>>= +T <- chisq(substance ~ shuffle(sex), data = HELPrct); T # test statistic +Substance.Null <- + do(999) * chisq(substance ~ shuffle(sex), data = HELPrct) +histogram( ~ X.squared, data = Substance.Null, v = T, width = 0.25) +prop1( ~(X.squared >= T), data = Substance.Null) +@ +\end{widestuff} +\noindent +Both the histogram and our randomization p-value suggest that the differences observed +between men and women are not statistically significant. + + +\subsection{Testing a single mean} + +\Note{Somewhat surprisingly, this is the most +challenging hypothesis test to handle with our system. +See below for one reason this doesn't bother us too much.}% +One wrinkle in our system is the test for a single mean. +Let's illustrate with a test of $H_0: \mu = 98.6$ using our sample of 50 body +temperatures. Testing a null hypothesis of the form +\begin{itemize} +\item + $H_0$: $\mu = \mu_0$ +\end{itemize} +is a bit of a special case. +Unlike the examples above, there is no +explanatory variable to shuffle. Unlike a test for a single proportion, +the null hypothesis does not completely specify the sampling distribution. + +\Note{Many books use $\overline{x}$ here instead of $\overline{y}$.}% +At least there is an obvious candidate for a test statistic: the sample mean, +$\overline y$. + +\Rindex{BodyTemp50}% +\myindex{test statistic}% +<<>>= +mean( ~ BodyTemp, data = BodyTemp50) +@ +This test statistic +is easily applied to any data set, we just need a way to generate random data sets +in which the null hypothesis is true. +As mentioned above, there is no explanatory variable to shuffle. +If we shuffle \variable{BodyTemp} (or the entire data set), we will get the same +mean every time, since the mean does not depend on order. + +Instead, we sample this time with replacement. +The \code{resample()} function does this. +<<>>= +resample(1:10) # notice the duplicates +@ +We can resample individual variables or the entire data frame. (Since there is only one +variable involved in this analysis, the results would be essentially the same either way.) + +<<>>= +# this doesn't work: +Temp0.Null <- + do(999) * mean( ~ BodyTemp, data = resample(BodyTemp50)) +@ +\noindent +Unfortunately, \code{Temp0.Null} is not a randomization distribution. +Inspecting a histogram shows that the distribution is not centered at 98.6, +so we are not simulating a world in which the null hypothesis is true. +<<>>= +histogram( ~mean, data = Temp0.Null) +@ +\noindent +Instead it is centered at the mean of our original sample, +\Sexpr{round(mean( ~ BodyTemp, data = BodyTemp50), 2)}. This hints +at a way to create a proper randomization distribution. We can shift the distribution +by $98.6 - \Sexpr{round(mean( ~ BodyTemp, data = BodyTemp50), 2)} = +\Sexpr{98.6 - round(mean( ~ BodyTemp, data = BodyTemp50), 2)}$. +That will result in a distribution that has the same shape as our data but a mean +of 98.6, as the null hypothesis demands. + +\smallskip + +\begin{widestuff} +<<>>= +Temp1.Null <- do(9999) * + mean( ~ BodyTemp + (98.6 - 98.26), data = resample(BodyTemp50)) +histogram( ~ mean, data = Temp1.Null, v = 98.26, center = 98.6) +@ + +\end{widestuff} +As before, we can now estimate a p-value by tallying how often we see a value at least as small as 98.26. +<>= +2 * prop1( ~ (mean <= 98.26), data = Temp1.Null) +@ +\noindent +\Note{We used more replicates in this example +to give us a better estimate of this small p-value.}% +This time the p-value is quite small -- it would seem that 98.6 is not the mean +body temperature. + +Of all the randomization distributions, randomization distributions used to test +hypotheses about a mean are the most awkward to create because of the shifting that +is required to center the distribution and the use of \code{resample()} (which can +cause confusion with bootstrap distributions). +Fortunately, creating a confidince interval from a bootstrap distribution +in this situation is straightforward, and we typically prefer confidence intervals +to p-values in this situation. + + +\section{The Bootstrap} + +The bootstrap is a method used (primarily) for creating confidence intervals. The +basic idea is quite simple and helps reinforce important ideas about what a +confidence interval is. + +\subsection{The idea behind the bootstrap} +\myindex{bootstrap}% + +\Caution{There are more complicated methods for computing bootstrap confidence +intervals that have better performance. We introduce bootstrap confidence +intervals using the two simple methods here. Sometimes we return later in the course +to talk about the bootstrap-t intervals.} +Suppose we want to estimate the mean body temperature using the \dataframe{BodyTemp50} +data set. It is simple enough to compute the mean from our data. +<<>>= +mean( ~ BodyTemp, data = BodyTemp50) +@ +\noindent +What is missing is some sense for how precise this estimate is. +The most common way to present this information is with a confidence interval. + +If we had access to the entire population, we could generate many random samples +to see how much variability there is in estimates from sample to sample +(see Section~\ref{sec:sampling-dists}). +In practice, we will never have access to the entire population (or we wouldn't need +to be making estimates). The key idea of the bootstrap is +to treat our sample as an approximate representation +of the population, and to generate an approximate sampling distribution by sampling +(with replacement) \emph{from our sample}. +\Note{We can use bootstrap methods to estimate the bias in the estimate +as well.}% +The shape of the bootstrap distribution indicates how precise our estimate is. + +Before we proceed, there are a few important things to note about this process. +\begin{enumerate} +\item Resampling does not provide a better estimate. + +Resampling is only used to estimate the sample-to-sample \emph{variability} +in our estimate, not in an attempt to improve the estimate itself. +If we attempted to improve our estimate using our bootstrap samples, +we would just make things worse by producing an estimate of our estimate +and essentially doubling any bias in the estimation. + +\item +Resampling works better with large samples than with small samples. + +Small samples are unlikely to represent the population well. While resampling can +provide methods that work as well as the traditional methods in standard situations +and which can be applied in a wider range of situations without degraded performance, +they do not fundamentally alter the need to have a sufficient sample size. + +\item +The two bootstrap methods we present below are chosen for simplicity, not for performance. + +The primary value in introducing bootstrapping in introductory courses is pedagogical, +not scientific. The percentile and standard error intervals introduced below are +readily accessible to students and can be applied in a wide range of situations. But they +are not the state of the art. +In Section~\ref{sec:improved-cis} we will briefly discuss the bootstrap-t interval, +a more accurate bootstrap method. Other methods, such as BCa (bias corrected and +accelerated) or ABC (approximate bootstrap confidence) also improve upon the percentile and +standard error methods, but are beyond the scope of most introductory courses. + +\Rindex{resample}% +\Rindex{boot}% +Packages like \pkg{resample} and \pkg{boot} provide functions for +computing intervals using more sophisticated methods. +\end{enumerate} + +\subsection{Bootstrap confidence intervals for a mean} + +\myindex{confidence interval}% +Creating a randomization distribution to test a hypothesis about a single mean +had some extra challenges. +Fortunately, a confidence interval is often preferable in this situation, +and creating a bootstrap distribution for a single mean is straightforward: +we simply compute the mean body temperature from many resampled versions of our +original data. + +<<>>= +Temp.Boot <- + do(1000) * mean( ~BodyTemp, data = resample(BodyTemp50)) +@ +\noindent +When applied to a data frame, the \function{resample()} function samples +rows with replacement to produce a new data frame with the same number of rows +as the original, but some rows will be duplicated and others missing. + +\Caution[-2cm]{In less than ideal situations, we may need to adjust for bias +or use more sophisticated methods. It is good for students to be in the habbit +of checking these features of the bootstrap distribution before using the +simple bootstrap methods we present in this section.} +Ideally, a bootstrap distribution should be unimodal, roughly symmetric, and +centered at the original estimate. +<<>>= +mean( ~ BodyTemp, data = BodyTemp50) +mean( ~ mean, data = Temp.Boot) +histogram( ~ mean, data = Temp.Boot, nint = 25, + v = mean( ~ BodyTemp, data = BodyTemp50), + c = mean( ~ BodyTemp, data = BodyTemp50) + ) +@ + +To compute a 95\% percentile confidence interval, we determine the range of the central +95\% of the bootstrap distribution. The \function{cdata()} function automates this +calculation. +\Rindex{cdata()}% +<<>>= +cdata( ~ mean, data = Temp.Boot, p = 0.95) +@ +\noindent +\Rindex{qdata()}% +Alternatively, \function{qdata()} can be used to obtain the left and right endpoints +separately (or for 1-sided confidence intervals). + +<<>>= +qdata( ~ mean, data = Temp.Boot, p = 0.025) +qdata( ~ mean, data = Temp.Boot, p = 0.975) +@ + +A second simple method for computing a confidence interval from a bootstrap +distribution involves using the boostrap distribution to estimate the standard error. +\myindex{standard error} +<<>>= +SE <- sd( ~ mean, data = Temp.Boot); SE +estimate <- mean( ~ BodyTemp, data = BodyTemp50) +estimate +estimate + c(-1,1) * 2 * SE +@ + +This method does not perform as well as the percentile method, +but can serve as a good bridge to the formula-based intervals often included even in +a course that focuses on simulation-based methods. +How to replace the constant 2 with an appropriate value to create more accurate +intervals or to allow for different confidence levels is a matter of some subtlety. +The simplest method is to use quantiles +of a normal distribution, but this will undercover. Replacing the normal distribution +with an appropriate t-distribution will widen intervals and can improve coverage, but +the t-distribution is only correct in a few cases -- such as when estimating the mean +of a normal population -- and can perform badly when the population is +skewed.\cite{Hesterberg:2015} + +Because each of these methods produces a confidence interval that depends only +on the distribution of the estimates computed from the resamples, they are easily +implemented in wide variety of situations. +Calculating either of these simple confidence intervals from the bootstrap distribution +can be further automated using an extension to \code{confint()}. + +\medskip + +\begin{widestuff} +<<>>= +confint(Temp.Boot, method = c("percentile", "stderr")) +@ +\end{widestuff} + +All that remains then is the generation of the bootstrap distribution itself. + +\subsection{Bootstrap confidence intervals for the difference in means} + +If we are interested in a confidence interval for the difference in group means, we can use +\code{resample()} and \code{do()} to generate a bootstrap distribution in one of two ways. + +\medskip + +\begin{widestuff} +<<>>= +Temp.Boot2a <- + do(1000) * diffmean(age ~ sex, data = resample(HELPrct)) +Temp.Boot2b <- + do(1000) * diffmean(age ~ sex, data = resample(HELPrct, groups = sex)) +@ +\end{widestuff} +\Note{It is useful to adopt a convention regarding the naming of +randomization and bootstrap distributions. The names should reflect that data +being used and whether the distribution is a bootstrap distribution or a +randomization distribution. We typically use \code{.Rand} or \code{.Null} +to indicate randomization distributions and \code{.Boot} to indicate bootstrap +distributions.} +\noindent +In the second example, the resampling happens within the sex groups so that the marginal +counts for each sex remain fixed. This can be especially important if one of the groups +is small, because otherwise some resamples might not include any observations of that +group. + +<>= +set.seed(123456) +@ +<<>>= +favstats(age ~ sex, data = HELPrct) +D <- diffmean( age ~ sex, data = HELPrct); D +favstats(age ~ sex, data = resample(HELPrct)) +favstats(age ~ sex, data = resample(HELPrct, groups = sex)) +@ + +From here, the computation of confidence intervals proceeds as before. + +\Note{Visually inspecting the bootstrap distribution for skew and bias is an important +step to make sure the percentile interval is not being applied in a situation where +it may perform poorly.} +<<>>= +histogram( ~ diffmean, data = Temp.Boot2b, v = D) +qqmath( ~ diffmean, data = Temp.Boot2b) +cdata( ~ diffmean, p = 0.95, data = Temp.Boot2b) +@ + +Alternatively, we can compute a confidence interval based on a bootstrap +estimate of the standard error. +<<>>= +SE <- sd( ~ diffmean, data = Temp.Boot2b); SE +D + c(-1,1) * 2 * SE +@ +% \noindent +% The primary pedagogical value of the bootstrap standard error approach is its close +% connection to the standard formula-based confidence interval methods. +% How to replace the constant 2 with an appropriate value to create more accurate intervals +% or to allow for different confidence levels is a matter of some subtlety +% \cite{Hesterberg:2015}. The simplest method is to use quantiles +% of a normal distribution, but this will undercover. Replacing the normal distribution +% with an appropriate t-distribution will widen intervals and can improve coverage, but +% the t-distribution is only correct in a few cases -- such as when estimating the mean +% of a normal population -- and can perform badly when the population is skewed. +% See Section~\ref{sec:improved-cis} for more on this. + + +\noindent +\Rindex{confint()}% +Either interval can be computed using \code{confint()}, if we prefer. +<<>>= +confint(Temp.Boot2b, method = c("percentile", "stderr")) +@ + +\subsection{Bootstrap distributions comparison} + +To illustrate the similarity among commands used to create +bootstrap distributions, we present five examples +that might appear in an introductory course. + +\medskip + +\begin{widestuff} +<>= +One.Proportion <- do(1000) * prop( ~ x, data = resample(Data)) +Two.Proportions <- do(1000) * diffprop( y ~ x, data = resample(Data, groups = x)) +One.Mean <- do(1000) * mean( ~ x, data = resample(Data)) +Two.Means <- do(1000) * diffmean( y ~ x, data = resample(Data, groups = x)) +Correlation <- do(1000) * cor( y ~ x, data = resample(Data)) +@ +\end{widestuff} + +In the next section we discuss how to extend this to regression models. + +\section{Resampling for Regression} + +There are at least two ways we can consider creating a bootstrap distribution +for a linear model. +We can easily fit a linear model to a resampled data set. But in some situations +this may have undesirable features. Influential observations, for example, will +appear duplicated in some resamples and be missing entirely from other resamples. + +Another option is to use ``residual resampling". In residual resampling, the new data set +has all of the predictor values from the original data set and a new response is created +by adding to the fitted function a resampled residual. + + +Both methods are simple to implement; +we either resample the data or resample the model itself. + +\medskip + +\Rindex{relm()}% +\Rindex{resample()}% +\begin{widestuff} +<<>>= +mod <- lm( length ~ width + sex, data = KidsFeet) # original model +do(1) * mod # see how do() treats it +do(2) * lm( length ~ width + sex, data = resample(KidsFeet)) # resampled data +do(2) * lm( length ~ width + sex, data = resample(mod)) # resampled residuals +do(2) * relm(mod) # abbreviated residual resampling +@ +\end{widestuff} + +From here it is straightforward to create a confidence interval for the slope +(or intercept, or any coefficient) in a linear model. +<<>>= +Kids.Boot <- do(1000) * relm(mod) +cdata( ~ width, data = Kids.Boot, p = 0.95) +confint( Kids.Boot, parm = "width") +@ + + +\section{Which comes first: p-values or intervals?} + +This is a matter of some discussion among instructors and textbook authors. The two +most recognizable introductory statistics books give different answers. +One\cite{Tintle:ISI:2015} introduces hypothesis testing first, +the other\cite{Lock5:2012} begins with bootstrap confidence intervals. +These two books differ in several other ways as well. +It remains to be seen whether best practices will emerge or whether +some issues will remain a matter of personal preference. This is not unlike +the older debate over whether one should begin with quantitative or categorical data +-- another way in which these two simulation-based books diverge. + +\section{Dealing with Monte Carlo Variability} + +Because randomization and bootstrap distributions involve a random component, p-values +and confidence intervals computed from the same data will vary. +For students (and graders), this can be disconcerting because there is no ``right" answer. + +The amount of Monte Carlo variability depends on the number of replicates used to +create the randomization or bootstrap distribution. And students will need some +guidance about how many replicates to use. It is important that they not use too +few as this will introduce too much random noise into p-value and confidence interval +calculations. But each replicate costs time, and the marginal gain for each additional +replicate decreases as the number of replicates incresases. There is little reason to +use millions of replicates (unless the goal is to estimate very small p-values). +We generally use roughly 1000 for routine or preliminary work and increase this to +10,000 when we want to reduce the effects of Monte Carlo variability. + +In a laboratory setting, it can be instructive to have students compare their p-values +or confidence intervals using 1,000 and 10,000 replicates. Alternatively, the instructor +can generate several p-values or confidence intervals to illustrate the same principle. + + +\section{Better Confidence Intervals} +\label{sec:improved-cis}% + +\myindex{bootstrap-t} +The percentile and ``t with bootstrap standard error" confidence intervals have been +improved upon in a number of ways. In a first course, we generally do little more +than mention this fact, and encourage students to inspect the shape of bootstrap +distribution for indications of potential problems with the percentile method. + +One improvement that can be explained to students in a course that combines +simulation-based and formula-based approaches is the bootstrap-t interval. +Rather than attempting to determine the best degrees of freedom for a Student's +t-distribution, the bootstrap-t approximates the actual distribution of +$$ +t = \frac{\hat{\theta} - \theta}{SE} +$$ +using the boostrap distribution of +$$ +t^* = \frac{\hat{\theta}^* - \hat{\theta}}{SE^*} \; , +$$ +where $\hat{\theta}^*$ and $SE^*$ are the estimate and estimated standard error +computed from each bootstrap distribution. +Implementing the bootstrap-t interval requires either an extra level of conceptual +framework or much more calculation to determine the values of $SE^*$. If a standard error +formula exists (e.g., $SE = s/\sqrt{n}$), this can be applied to each bootstrap +sample along with the estimator. An alternative is to iterate the bootstrap procedure +(resampling from each resample) to estimate $SE^*$. Since standard errors are easier +to estimate than confidence intervals, fewer resamples are required (per resample) +at the second level; nevertheless, the additional computational overhead is significant. + +The \pkg{mosaic} package does not attempt to provide a general framework for the bootstrap-t +or other ``second-order accurate" boostrap methods. +Packages such as \pkg{resample}\cite{resample} and \pkg{boot}\cite{boot,boot-book} +are more appropriate for situations where speed and accuracy are of utmost importance. +But the bootstrap-t confidence interval can be computed using +\code{confint()}, \code{do()} and \code{favstats()} in the case of estimating a single mean or +the difference between two means. + +In the example below, we analyse a data set from the \pkg{resample} package. The +\dataframe{Verizon} data set contains repair times for customers in CLEC (competitive) +and ILEC (incumbant) local exchange carrior. + +\medskip + +\begin{widestuff} +<<>>= +# the resample package has name collisions with mosaic, +# so we only load the data, not the package +data(Verizon, package = "resample") +ILEC <- Verizon %>% filter(Group == "ILEC") +favstats( ~ Time, groups = Group, data = Verizon) + ashplot( ~ Time, groups = Group, data = Verizon, + auto.key = TRUE, width = 20) +@ +\end{widestuff} + +\noindent +The skewed distributions of the repair times and unequal sample sizes highlight differences +between the bootstrap-t and simpler methods. + +<<>>= +BootT1 <- + do(1000) * favstats(~ Time, data = resample(ILEC)) +confint(BootT1, method = "boot") +BootT2 <- + do(1000) * favstats( ~ Time, groups = Group, + data = resample(Verizon, groups = Group)) +confint(BootT2, method = "boot") +@ +\noindent +This can also be accomplished manually, although the computations are a bit involved +for the 2-sample case. Here are the manual computations for the 1-sample case: +<>= +estimate <- mean( ~ Time, data = ILEC) +estimate +SE <- sd( ~ mean, data = BootT1); SE +BootT1a <- + BootT1 %>% + mutate( T = (mean - mean(mean)) / (sd/sqrt(n))) +q <- quantile(~ T, p = c(0.975, 0.025), data = BootT1a) +q +estimate - q * SE +densityplot( ~ T, data = BootT1a) +plotDist("norm", add = TRUE, col="gray50") +@ + +For comparison, here are the intervals produced by \code{t.test()} and the percentile method. + +\medskip + +\begin{widestuff} +<<>>= +confint(t.test( ~ Time, data = ILEC)) +BootT1b <- + do(1000) * mean( ~ Time, data = resample(ILEC)) +confint(BootT1b, method = "perc") +@ +\end{widestuff} + +\begin{widestuff} +<<>>= +confint(t.test(Time ~ Group, data = Verizon)) +BootT2b <- + do(1000) * diffmean(Time ~ Group, data = resample(Verizon, groups = Group)) +confint(BootT2b, method = "perc") +@ +\end{widestuff} + +\noindent +In a situation like this, the intervals produced by \code{t.test()} are narrower, +do the least to compensate for skew, +undercover, and miss more often in one direction than in the other. + +Even if these methods are not presented to students, it is good for instructors to +be at least somewhat familiar with the issues involved and some of the methods that +have been developed to handle them.\cite{Hesterberg:2015} +% for a more +% thorough discussion of what instructors should know about the bootstrap. + +\section{Simulating sampling distributions} +\label{sec:sampling-dists}% +\myindex{sampling distribution}% +We conclude this chapter with one more use of \code{sample()}. If we treat a data +frame as a population, \code{sample()} can be used to draw random samples of a +specified size to illustrate the idea of a sampling distribution. We could +use this to illustrate the sampling distribution of a sample mean, for example. + +\Rindex{NHANES}% +As an example, we will use the \dataframe{NHANES} data. This data set has +been adjusted to reflect the sampling weights used in the +American National Health and Nutrition Examination surveys and +is a reasonably good approximation to a simple random sample of size 10,000 +from the US population. For the purpose of this example, we will treat this +as the entire population and consider samples drawn from it, focusing (for the moment) +on the \variable{Age} variable. + +<<>>= +require(NHANES) +mean( ~ Age, data = NHANES) # population mean +@ + +We will consider samples of size 50 and size 200. This can be used to demonstrate +the role of sample size in the sampling distribution. + +\medskip + +\begin{widestuff} +<<>>= +mean( ~ Age, data = sample(NHANES, 50)) # mean of one sample +mean( ~ Age, data = sample(NHANES, 50)) # mean of another sample +@ +\end{widestuff} + +\begin{widestuff} +<<>>= +# We use bind_rows() to combine two sampling distributions +# (with different sample sizes) into a single data frame to +# make graphical and numerical summaries easier. +SamplingDist <- + bind_rows( + do(2000) * c(mean = mean( ~ Age, data = sample(NHANES, 50)), n= 50), + do(2000) * c(mean = mean( ~ Age, data = sample(NHANES, 200)), n= 200) + ) +@ +\end{widestuff} + +\begin{widestuff} +<<>>= +mean( mean ~ n, data = SamplingDist) # mean of sampling distribution +sd( mean ~ n, data = SamplingDist) # SE from sampling distribution +@ +\end{widestuff} + +\begin{widestuff} +<<>>= +sd( ~ Age, data = NHANES) / c("50" = sqrt(50), "200" = sqrt(200)) # SE from formula +histogram( ~ mean | factor(n), data = SamplingDist, + nint = 50, density = TRUE) +@ +\end{widestuff} + +A similar approach can be used to create sampling distributions in other situations. diff --git a/Starting/Starting-Printed-Form.pdf b/Starting/Starting-Printed-Form.pdf new file mode 100644 index 0000000..0304561 Binary files /dev/null and b/Starting/Starting-Printed-Form.pdf differ diff --git a/Starting/StartingAdvice.Rnw b/Starting/StartingAdvice.Rnw index 0efa34f..638f6c7 100644 --- a/Starting/StartingAdvice.Rnw +++ b/Starting/StartingAdvice.Rnw @@ -1,6 +1,6 @@ <>= opts_chunk$set( fig.path="figures/RIntro-" ) -set_parent('Master-Starting.Rnw') +set_parent('MOSAIC-StartTeaching.Rnw') set.seed(123) @ @@ -15,15 +15,15 @@ Learning \R\ is a gradual process, and getting off to a good start goes a long w ensuring success. In this chapter we discuss some strategies and tactics for getting started teaching statistics with \R. -In subsequent chapters we provide more details about the (relatively few) \R\ \marginnote{The \pkg{mosaic} package includes a vignette outlining a possible -minimalist set of \R\ commands for teaching an introductory course.} +minimalist set of \R\ commands for teaching an introductory course.}% +In subsequent chapters we provide more details about the (relatively few) \R{} commands that students need to know and some additional information about \R\ that is useful for instructors to know. Along the way we present some of our favorite examples that highlight the use of \R, including some that can be used very early in a course. -\authNote{add a pointer to the 1 page handout somewhere?} +%\authNote{add a pointer to the 1 page handout somewhere?} \section{Strategies} @@ -93,9 +93,9 @@ seeks to make more things simpler and more similar to each other so that student can more easily become independent, creative users of \R. But even if you don't choose to do things exactly the way we do, we recommend using ``Less Volume, More Creativity" as a guideline.} + Use a few methods frequently and students will learn how to use them well, flexibly, even creatively. - Focus on a small number of data types: numerical vectors, character strings, factors, and data frames. Choose functions that employ a similar framework and style to increase the ability of students @@ -162,12 +162,13 @@ Increased focus on concepts rather than calculations Get your students to think that using the computer is just part of how statistics is done, rather than an add-on. \item -\BlankNote{It is +Keep the message as simple as possible and keep the commands accordingly simple. +\BlankNote[-1cm]{It is important not to get too complicated too quickly. Early on, we typically use default settings and focus on the main ideas. Later, we may introduce fancier options as students become comfortable with simpler things (and often demand more).} -Keep the message as simple as possible and keep the commands - accordingly simple. Particularly when doing graphics, beware of distracting + +Particularly when doing graphics, beware of distracting students with the sometimes intricate details of beautifying for publication. If the default behavior is good enough, go with it. @@ -210,18 +211,24 @@ See Chapter~\ref{chap:RForInstructors} for some of the common error messages and \begin{enumerate} \item -Introduce Graphics Early.\marginnote[1cm]{In keeping with this advice, most of the examples in this book fall in the area of exploratory data analysis. The organization is chosen to develop gradually anunderstanding of \R. See the companion volume\textit{A Compendium of Commands to Teach Statistics with R} for a tour of commands used in the primary sorts analyses used in the first two undergraduate statistics courses. This companion volume is organized by types of data analyses and presumes some familiarity with the \R\ language.} +Introduce Graphics Early. Introduce graphics very early, so that students see that they can get impressive output from simple commands. Try to break away from their prior expectation that there is a ``steep learning curve." Accept the defaults -- don't worry about the niceties (good labels, nice breaks on histograms, colors) too early. Let them become comfortable with the basic graphics commands and then play (make sure it feels like play!) with fancying things up. - -Keep in mind that just because the graphs are easy to make on the computer doesn't mean your students understand how to read the graphs. Use examples that will help students develop good habits for visualizing data.% -%Remember: -%\begin{center} -%\end{center} +\marginnote{In keeping with this advice, most of the examples in this book +fall in the area of exploratory data analysis. The organization is chosen to +develop gradually anunderstanding of \R. See the companion volume +\textit{A Student's Guide to R} for a tour of commands used +in the primary sorts analyses used in the first two undergraduate statistics +courses. This companion volume is organized by types of data analyses and +presumes some familiarity with the \R\ language.} + +Keep in mind that just because the graphs are easy to make on the computer +doesn't mean your students understand how to read the graphs. Use examples that +will help students develop good habits for visualizing data. \item Introduce Sampling and Randomization Early. diff --git a/Starting/TheTemplate.Rnw b/Starting/TheTemplate.Rnw index 0438416..26a982e 100644 --- a/Starting/TheTemplate.Rnw +++ b/Starting/TheTemplate.Rnw @@ -1,8 +1,6 @@ <>= -opts_chunk$set( fig.path="figures/RForStudents-" ) -set_parent('Master-Starting.Rnw') -require(mosaic) -require(mosaicData) +opts_chunk$set(fig.path="figures/RForStudents-") +set_parent("MOSAIC-StartTeaching.Rnw") set.seed(123) @ @@ -40,6 +38,15 @@ is nothing left to take away. \bigskip +\marginnote{Mike McCarthy, head coach of the Green Bay Packers football team +uses ``Less Volume, More Creativity" as a mantra for his coaching staff +as they prepare the game plan each week. As an illustration of the principle +at work, when asked by a fan how many pass plays the team prepares for a +given opponent, the coach answered, +``When I first got into the NFL we had 150 passes in our game plan. +I've put a sign on all of the coordinators' doors -- Less volume, more creativity. +We function with more concepts with less volume. +[Now] We're more around 50 [passes] in a game plan.} \noindent One key to successfully introducing \R\ is finding a set of commands that is \begin{itemize} @@ -48,6 +55,7 @@ One key to successfully introducing \R\ is finding a set of commands that is \item {powerful}. % can do what needs doing \end{itemize} + This chapter provides an extensive example of this ``Less Volume, More Creativity" approach. The \pkg{mosaic} package (combined with the \pkg{lattice} package and other @@ -96,15 +104,17 @@ will also emphasize its importance.} The template has a bit more flexibility than we have indicated. Sometimes the \code{y} is not needed: <>= -goal ( ~ x, data=mydata ) +goal( ~ x, data=mydata ) @ +\noindent The formula may also include a third part <>= -goal ( y ~ x | z , data=mydata ) +goal( y ~ x | z , data=mydata ) @ +\noindent We can unify all of these into one form: <>= -goal ( formula , data=mydata ) +goal( formula , data=mydata ) @ The template can be applied to create numerical summaries, graphical summaries, or model fits @@ -127,7 +137,7 @@ This is the goal. \section{Graphical summaries of data} -\TeachingTip[-2cm]{We recommend showing some plots on the first day and having student +\TeachingTip[-2cm]{We recommend showing some plots on the first day and having students generate their own graphs before the end of the first week.} % Graphical summaries are an important and eye-catching way to demonstrate the @@ -142,7 +152,8 @@ data, to think about distributions, and to pose statistical questions. \Rindex{gplot2} \Rindex{lattice} \Rindex{ggvis} -\Pointer[-1cm]{We are often asked about the other graphics systems, especially \pkg{ggplot2} graphics. In our experience, \pkg{lattice} makes it easier for beginners to create a wide variety of more or less ``standard'' plots -- including the ability to represent multiple variables at once. \pkg{ggplot2}, on the other hand, makes it easier to generate custom plots or to combine plot components. Each has their place, and we use both systems. But for beginners, we typically emphasize \pkg{lattice}. +\Pointer[-1cm]{We are often asked about the other graphics systems, especially \pkg{ggplot2} graphics. In our experience, \pkg{lattice} makes it easier for beginners to create a wide variety of more or less ``standard'' plots -- including the ability to represent multiple variables at once. \pkg{ggplot2}, on the other hand, makes it easier to generate custom plots or to combine plot components. Each has its place, and we use both systems. +But for beginners, we typically emphasize \pkg{lattice}. The new \pkg{ggvis} package, by the same author as \pkg{ggplot2} adds interactivity and speed to the strengths of \pkg{ggplot2}.} % @@ -167,7 +178,7 @@ number of births in the United States for each day in 1978. \Rindex{xyplot()} <>= -xyplot( births ~ date, data=Births78) +xyplot(births ~ date, data=Births78) @ \TeachingTip[-2cm]{This plot can make an interesting discussion starter @@ -211,17 +222,22 @@ Now let's create this plot, which shows boxplots of age for each of three substances abused by participants in the \emph{Health Evaluation and Linkage to Primary Care} randomized clinical trial. -\Pointer{You can find out more about the \dataframe{HELPrct} data set using the help command: \code{?HELPrct}.} +\Pointer{You can find out more about the \dataframe{HELPrct} data set using the help command: +\code{?HELPrct}. This will provide you with the codebook for the data and links to the +original source. + +There are also a number of functions that allow us to inspect the contents of a data frame. +Among our favorites are \code{inspect()}, \code{glimpse()}, and \code{head()}.} \Rindex{HELPrct} <>= -bwplot( age ~ substance, data=HELPrct) +bwplot(age ~ substance, data=HELPrct) @ The data we need are in the \dataframe{HELPrct} data frame, from which we want to display variables \variable{age} and \variable{substance} on the $y$- and $x$-axes. According to our template, the command to create this plot has the form <>= -goal( age ~ substance, data=HELPrct ) +goal(age ~ substance, data=HELPrct) @ The only additional information we need is the name of the function that creates boxplots. That function is \function{bwplot()}. So we can create the plot with <>= @@ -231,13 +247,15 @@ The only additional information we need is the name of the function that creates To make the boxplots horizontal instead of vertical, reverse the roles of \variable{age} and \variable{substance}: <<>>= -bwplot( substance ~ age, data=HELPrct ) +bwplot(substance ~ age, data=HELPrct) @ \Pointer{You may be wondering about plots for two categorical variables. A commonly used plot for this is a segmented bar graph. We will treat this as a augmented version of a simple bar graph, which is a graphical summary of one categorical variable. -Another plot that can be used to display two (or more) categorical variables is a mosaic plot. The \pkg{lattice} package does not include mosaic plots, but the \pkg{vcd} package provides a \function{mosaic()} function that creates mosaic plots.} +Another plot that can be used to display two (or more) categorical variables is a mosaic plot. +The \pkg{lattice} package does not include mosaic plots, but the \pkg{vcd} package provides +a \function{mosaic()} function that creates mosaic plots.} \Rindex{vcd} \myindex{mosaic plot} @@ -292,54 +310,75 @@ horizontal bar graphs are produced using \option{horizontal = TRUE}. \Rindex{barchart()} \Pointer{The \function{bargraph()} function is not in the \pkg{lattice} package but in the \pkg{mosaic} package. The \pkg{lattice} function \function{barchart()} creates bar graphs from \emph{summarized} data; \pkg{bargraph()} takes care of creating this summary data and then uses \function{barchart()} to create the plot.} <>= -bargraph( ~ substance, data=HELPrct ) -bargraph( ~ substance, data=HELPrct, horizontal=TRUE ) +bargraph( ~ substance, data=HELPrct) +bargraph( ~ substance, data=HELPrct, horizontal=TRUE) @ + \subsection{A palette of plots} \label{sec:paletteOfPlots} +\Pointer{If you are unfamiliar with some of the plots, like ashplots and frequency +polygons, keep reading. We have more to say about them shortly.} + The power of the template is that we can now make many different kinds of plots by mimicking the examples above but replacing the goal. +%(The plots appear in Figure~\ref{fig:one-var-plots}.) \Rindex{densityplot()} \Rindex{freqpolygon()} \Rindex{dotPlot()} \Rindex{qqmath()} +\Rindex{ashplot()} <>= - histogram( ~age, data=HELPrct ) -densityplot( ~age, data=HELPrct ) -freqpolygon( ~age, data=HELPrct ) - dotPlot( ~age, data=HELPrct, width=1 ) - bwplot( ~age, data=HELPrct ) - qqmath( ~age, data=HELPrct ) -@ - -\begin{widestuff} -<>= + histogram( ~ age, data=HELPrct) +freqpolygon( ~ age, data=HELPrct) + dotPlot( ~ age, data=HELPrct, width=1) + ashplot( ~ age, data=HELPrct, width=1) +densityplot( ~ age, data=HELPrct) + qqmath( ~ age, data=HELPrct) + bwplot( ~ age, data=HELPrct) + bwplot( ~ age, data=HELPrct, pch = "|") +@ + +%\begin{figure} +<>= <> @ -\end{widestuff} +%\caption{Some one-variable plots.} +%\label{fig:one-var-plots} +%\end{figure} +\Note[-4cm]{If you prefer the more traditional boxplot display with a line at the median +rather than a dot, you can make that the default behavior with +\code{trellis.par.set(box.dot = list(pch = "|")).}} +\noindent +Some people prefer the more traditional boxplot display with a line at the median +rather than a dot. We can make this the default behavior using +<<>>= +trellis.par.set(box.dot = list(pch = "|")) +@ +\newpage For one categorical variable, we can use a bar graph. \Note{The \pkg{lattice} package does not supply a function for creating pie charts. This is no great loss since it is generally harder to make comparisons using a pie chart.} -<<>>= - bargraph( ~sex, data=HELPrct ) # categorical variable +<>= + bargraph( ~ sex, data=HELPrct) # categorical variable @ \bigskip +Two-variable plots are also very similar. \Rindex{plotPoints()} <>= - xyplot( width ~ length, data=KidsFeet ) # 2 quantitative vars -plotPoints( width ~ length, data=KidsFeet ) # mosaic alternative - bwplot( length ~ sex, data=KidsFeet ) # 1 cat; 1 quant - bwplot( sex ~ length, data=KidsFeet ) # reverse roles + xyplot( width ~ length, data=KidsFeet) # 2 quantitative vars +plotPoints( width ~ length, data=KidsFeet) # mosaic alternative + bwplot(length ~ sex, data=KidsFeet) # 1 cat; 1 quant + bwplot( sex ~ length, data=KidsFeet) # reverse roles @ <>= <> @@ -356,17 +395,17 @@ for larger data sets. \Rindex{dotplot()} \Rindex{stripplot()} <>= -stripplot( ~length, data=KidsFeet ) - dotplot( ~length, data=KidsFeet ) +stripplot( ~ length, data=KidsFeet) + dotplot( ~ length, data=KidsFeet) @ \TeachingTip{We generally don't introduce \function{dotplot()} and \function{stripplot()} to students but simply use \function{xyplot()} or \function{plotPoints()}.} These and \function{xyplot()} or \function{plotPoints()} can also be used with one quantitative variable and one categorical variable. <>= - xyplot( sex ~ length, data=KidsFeet ) -plotPoints( sex ~ length, data=KidsFeet ) - stripplot( sex ~ length, data=KidsFeet ) - dotplot( sex ~ length, data=KidsFeet ) + xyplot(sex ~ length, data=KidsFeet) +plotPoints(sex ~ length, data=KidsFeet) + stripplot(sex ~ length, data=KidsFeet) + dotplot(sex ~ length, data=KidsFeet) @ \subsection{Groups and sub-plots} @@ -411,12 +450,12 @@ gives us the numerical summary we desire. \Rindex{mean()} <<>>= -histogram( ~ age, data=HELPrct ) - mean( ~ age, data=HELPrct ) +histogram( ~ age, data=HELPrct) + mean( ~ age, data=HELPrct) @ \Pointer[-2cm]{To see the full list of these formula-aware -numerical summary functions, use \code{help(favstats).} } +numerical summary functions, use \code{help(favstats)}.} \Rindex{sd()} \Rindex{var()} @@ -438,7 +477,7 @@ numerical summaries, including In addition, the \function{favstats()} function computes many of our favorite statistics all at once: <<>>= -favstats( ~ age, data=HELPrct ) +favstats( ~ age, data=HELPrct) @ \Rindex{tally()} The \function{tally()} function can be used to count cases. @@ -461,29 +500,29 @@ in three ways. Each of these computes the same value. <<>>= # age dependant on substance -sd( age ~ substance, data=HELPrct ) +sd( age ~ substance, data=HELPrct) # age separately for each substance -sd( ~ age | substance, data=HELPrct ) +sd( ~ age | substance, data=HELPrct) # age grouped by substance -sd( ~ age, groups=substance, data=HELPrct ) +sd( ~ age, groups=substance, data=HELPrct) @ The \function{favstats()} function can compute several numerical summaries for each subset <<>>= -favstats( age ~ substance, data=HELPrct ) +favstats(age ~ substance, data=HELPrct) @ Similarly, we can create two-way tables that display either as counts or proportions. <<>>= -tally( sex ~ substance, data=HELPrct ) -tally( ~ sex + substance, data=HELPrct ) +tally(sex ~ substance, data=HELPrct) +tally( ~ sex + substance, data=HELPrct) @ Marginal totals can be added with \option{margins=TRUE} <<>>= -tally( sex ~ substance, data=HELPrct, margins=TRUE ) -tally( ~ sex + substance, data=HELPrct, margins=TRUE ) +tally(sex ~ substance, data=HELPrct, margins=TRUE) +tally( ~ sex + substance, data=HELPrct, margins=TRUE) @ \section{Linear models} @@ -505,8 +544,8 @@ For example, suppose we want to know how the width of kids' feet depends on the length of the their feet. We could make a scatter plot and we can construct a linear model using the same template <<>>= -xyplot( width ~ length, data=KidsFeet ) -lm( width ~ length, data=KidsFeet ) +xyplot(width ~ length, data=KidsFeet) + lm(width ~ length, data=KidsFeet) @ We'll have more to say about modeling elsewhere. For now, the important point is that our use of the template for graphing and numerical summaries prepares students @@ -521,22 +560,22 @@ tests for means and proportions. The \pkg{mosaic} package brings these into the template as well. \Pointer{For a more thorough treatment of how to use \R\ for the core topics of a traditional introductory statistics course, -see \emph{A Compendium of Commands to Teach Statistics with R}.}% +see \emph{A Student's Guide to R}.}% +\Pointer[0.5cm]{Chi-squared tests can be performed using \function{chisq.test()}. This function is a little different in that it operates on tabulated data of the sort produced by \function{tally()} rather than on the data itself. So the use of the template happens inside \function{tally()} +rather than in \function{chisq.test()}.} \authNote{We could use better examples here. -- rjp 2014-06-21}% \Rindex{t.test()} <>= -t.test( ~ length, data=KidsFeet ) +t.test( ~ length, data=KidsFeet) @ The output from these functions also includes more than we really need. The \pkg{mosaic} package provides \function{pval()} and \function{confint()} for extracting p-values and confidence intervals: -\Pointer{Chi-squared tests can be performed using \function{chisq.test()}. This function is a little different in that it operates on tabulated data of the sort produced by \function{tally()} rather than on the data itself. So the use of the template happens inside \function{tally()} % -rather than in \function{chisq.test()}.} \Rindex{pval()} \Rindex{confint()} <<>>= -pval( t.test( ~ length, data=KidsFeet ) ) -confint( t.test( ~ length, data=KidsFeet ) ) +pval(t.test( ~ length, data=KidsFeet)) +confint(t.test( ~ length, data=KidsFeet)) @ \Rindex{binom.test()} \Rindex{prop.test()} @@ -545,16 +584,17 @@ confint( t.test( ~ length, data=KidsFeet ) ) OLD <- options(width=100) @ <<>>= -confint(t.test( length ~ sex, data=KidsFeet )) +confint(t.test(length ~ sex, data=KidsFeet)) @ + <<>>= # using Binomial distribution -confint(binom.test( ~ sex, data=HELPrct )) +confint(binom.test( ~ sex, data=HELPrct)) @ <<>>= # using normal approximation to the binomial distribution -confint(prop.test( ~ sex, data=HELPrct )) -confint(prop.test( sex ~ homeless, data=HELPrct )) +confint(prop.test( ~ sex, data=HELPrct)) +confint(prop.test(sex ~ homeless, data=HELPrct)) @ <>= options(OLD) @@ -592,6 +632,7 @@ students ask or an analysis demands them. \subsection{Example: Number of births per day} +\label{sec:births-lines} We have seen the \dataframe{Births78} data set in Section~\ref{sec:Births78Intro}. The plots below take advantage of additional arguments to improve the plot. @@ -616,7 +657,7 @@ From this we can be quite certain that 1978 began on a Sunday. \myindex{titles (plots)} \Rindex{par.settings} <>= -xyplot( births ~ date, data=Births78, +xyplot(births ~ date, data=Births78, groups=dayofyear %% 7, auto.key=list(columns=4), main="Number of US births each day in 1978", @@ -653,8 +694,8 @@ Here we have used The following plot uses lines instead of points which makes it easier to locate the handful of unusual observations. <>= -xyplot( births ~ date, data=Births78, - groups=dayofyear %% 7, type='l', +xyplot(births ~ date, data=Births78, + groups=wday, type='l', main="Number of US births each day in 1978", auto.key=list(columns=4, lines=TRUE, points=FALSE), xlab="day of year", @@ -678,7 +719,8 @@ trellis.par.set(col.whitebg()) show.settings() @ -\Pointer{In the printed version of this book, all three examples appear in black and white and were processed with \verb+theme.mosaic(bw=TRUE)+. In the online version, the first and third examples appear in color.} +\Pointer{In the printed version of this book, all three examples appear in black and white and were processed with \texttt{theme.mosaic(bw=TRUE)}. +In the online version, the first and third examples appear in color.} <>= trellis.par.set(theme.mosaic(bw=TRUE)) @@ -694,8 +736,8 @@ show.settings() Themes can also be assigned to \option{par.settings} if we want them to affect only one plot: <>= -xyplot( births ~ date, data=Births78, - groups=dayofyear %% 7, type='l', +xyplot(births ~ date, data=Births78, + groups=wday, type='l', main="Number of US births each day in 1978", auto.key=list(columns=4, lines=TRUE, points=FALSE), par.settings=theme.mosaic(bw=TRUE), @@ -775,7 +817,7 @@ Numerically, the data are being summarized and represented in exactly the same w \Rindex{ladd()} <>= histogram( ~ duration, data=geyser, n=15, col="lightskyblue") -ladd( panel.freqpolygon(geyser$duration, n=15) ) +ladd(panel.freqpolygon(geyser$duration, n=15)) @ This may give a more accurate visual representation in some situations (since the distribution can ``taper off'' better). More importantly, it makes @@ -789,6 +831,24 @@ freqpolygon( ~ Sepal.Length, data=iris, @ %\medskip + +\subsection{ASH plots: Average Shifted Histograms} +\Rindex{ashplot()} + +Histograms are sensitive to the choice of bin widths and edges (or centers). One way to reduce +this dependency is called an Average Shifted Histogram or ASH plot. The height of an ASH plot +is the average height over all histograms of a fixed bin width. If you are familiar with +density plots (discussed in the next section), ASH plots will remind you them, but they are +far easier to explain to beginners. + +<<>>= +ashplot( ~ Sepal.Length, data=iris, groups=Species, + width = 1.0, main = "width = 1.0") +ashplot( ~ Sepal.Length, data=iris, groups=Species, + width = 0.25, main = "width = 0.25") +@ + + \subsection{Density plots: \texttt{densityplot()}} \Rindex{densityplot()} @@ -806,9 +866,9 @@ Higher values smooth more heavily; lower values, less so. <>= densityplot( ~ Sepal.Length, data=iris, groups=Species, - adjust=3, main="adjust=3") + adjust=3, main="adjust=2") densityplot( ~ Sepal.Length, data=iris, groups=Species, - adjust=1/3, main="adjust=1/3") + adjust=1/3, main="adjust=1/2") @ \subsection{The Density Scale} @@ -834,7 +894,7 @@ The density scale is chosen so that the constant of proportionality is 1, in whi \TeachingTip[-1.5cm]{Create some histograms or frequency polygons with a density scale and see if your students can determine what the scale is. Choosing convenient bin widths (but not 1) and comparing plots with different bin widths and different scale types can help them reach a good conjecture about the density scale.} This is the only scale available for \function{densityplot()} and is the most suitable scale if one is primarily interested in the \emph{shape} of the distribution. The vertical scale is affected very little by the choice of bin widths or \option{adjust} multipliers. It is also the appropriate scale to use when overlaying a density function onto a histogram, something the \pkg{mosaic} package makes easy to do. <>= -histogram( ~ Sepal.Length | Species, data=iris, fit="normal" ) +histogram( ~ Sepal.Length | Species, data=iris, fit="normal") @ The other scales are primarily of use when one wants to be able to read off bin counts or percents from the plot. @@ -876,7 +936,7 @@ Suppose we want to display the following table (based on data from the \myindex{Current Population Survey|see{\texttt{CPS85}}} \Rindex{tally()} <<>>= -tally( ~sector, data=CPS85 ) +tally( ~ sector, data=CPS85) @ The \pkg{mosaic} function \function{bargraph()} can display these tables as bar graphs, but there isn't enough room for the labels. @@ -986,39 +1046,7 @@ discussion of what you can learn from it. \end{enumerate} \end{problem} -\begin{problem} -The table below is from a study of nighttime lighting in infancy and -eyesight (later in life). -% latex table generated in R 2.12.1 by xtable 1.5-6 package -% Fri Feb 4 15:46:48 2011 -\begin{center} -\begin{tabular}{rrrr} - \hline - & no myopia & myopia & high myopia \\ - \hline -darkness & 155 & 15 & 2 \\ - nightlight & 153 & 72 & 7 \\ - full light & 34 & 36 & 3 \\ - \hline -\end{tabular} -\end{center} -\begin{enumerate} -%\item -%Do you think this was an experiment or an observational study? Why? -\item -Recreate the table in \R. %Copy and paste the results into your Word document. -\item -What percent of the subjects slept with a nightlight as infants? - -There are several ways to do this. You could use \R\ as a calculator to do the arithmetic. -You can save some typing if you use the function \function{tally()}. See -\code{?tally} for documentation. -%If you just want row and column totals added to the table, see \verb!mar_table()! -%in the \verb!vcd! package. -\item Create a graphical representation of the data. What does this plot reveal? -\end{enumerate} -\end{problem} \section{Saving Your Plots} @@ -1035,9 +1063,13 @@ Go to your document (e.g. Microsoft Word) and paste in the image. \item Resize or reposition your image as needed. \end{enumerate} -\Rindex{pdf()} -The \function{pdf()} function can be used to save plots as pdf files. See -the documentation of this function for details and links to functions that +Altenatively, a plot can be exported to a file. + +\Rindex{pdf()}% +\R{} also provides function like \function{pdf()} +and \function{png()} that can be used to save plots in a +varity of formats. +See the documentation of these functions for details and links to functions that can be used to save graphics in other file formats. \section{\texttt{mplot()}} @@ -1092,6 +1124,7 @@ x <- 1:10 \Rindex{histogram()} \Rindex{dotPlot()} \Rindex{freqpolygon()} +\Rindex{ashplot()} \Rindex{densityplot()} \Rindex{qqmath()} \Rindex{bwplot()} @@ -1100,26 +1133,26 @@ x <- 1:10 \Rindex{mplot()} \myindex{quantile-quantile plots|see{\texttt{qqmath}}} <>= -require(mosaic) # load the mosaic package -require(mosaicData) # load the mosaic data sets - -tally( ~ sector, data=CPS85 ) # frequency table -tally( ~ sector + race, data=CPS85 ) # cross tabulation of sector by race -mean( ~ age, data = HELPrct ) # mean age of HELPrct subjects -mean( ~ age | sex, data = HELPrct ) # mean age of male and female HELPrct subjects -mean( age ~ sex, data = HELPrct ) # mean age of male and female HELPrct subjects -median(x); var(x); sd(x); # more numerical summaries -quantile(x); sum(x); cumsum(x) # still more summaries -favstats( ~ Sepal.Length, data=iris ) # compute favorite numerical summaries - -histogram( ~ Sepal.Length | Species, data=iris ) # histograms (with extra features) -dotPlot( ~ Sepal.Length | Species, data=iris ) # dot plots for each species -freqpolygon( ~ Sepal.Length, groups = Species, data=iris ) # overlaid frequency polygons -densityplot( ~ Sepal.Length, groups = Species, data=iris ) # overlaid densityplots -qqmath( ~ age | sex, data=CPS85 ) # quantile-quantile plots -bwplot( Sepal.Length ~ Species, data = iris ) # side-by-side boxplots -xyplot( Sepal.Length ~ Sepal.Width | Species, data=iris ) # scatter plots for each species -bargraph( ~ sector, data=CPS85 ) # bar graph +require(mosaic) # load the mosaic package +require(mosaicData) # load the mosaic data sets + +tally( ~ sector, data=CPS85) # frequency table +tally( ~ sector + race, data=CPS85) # cross tabulation of sector by race +mean( ~ age, data = HELPrct) # mean age of HELPrct subjects +mean( ~ age | sex, data = HELPrct) # mean age of male and female subjects +mean(age ~ sex, data = HELPrct) # mean age of male and female subjects +median(x); var(x); sd(x); # more numerical summaries +quantile(x); sum(x); cumsum(x) # still more summaries +favstats( ~ Sepal.Length, data=iris) # compute favorite numerical summaries + +histogram( ~ Sepal.Length | Species, data=iris) # histograms (with extra features) +dotPlot( ~ Sepal.Length | Species, data=iris) # dot plots for each species +freqpolygon( ~ Sepal.Length, groups = Species, data=iris) # overlaid freq. polygons +densityplot( ~ Sepal.Length, groups = Species, data=iris) # overlaid densityplots +qqmath( ~ age | sex, data=CPS85) # quantile-quantile plots +bwplot(Sepal.Length ~ Species, data = iris) # side-by-side boxplots +xyplot(Sepal.Length ~ Sepal.Width | Species, data=iris) # side-by-side scatter plots +bargraph( ~ sector, data=CPS85) # bar graph @ <>= mplot(HELPrct) # interactive plot (RStudio only) diff --git a/Starting/file.xlsx b/Starting/file.xlsx new file mode 100644 index 0000000..c9fdd89 Binary files /dev/null and b/Starting/file.xlsx differ diff --git a/Compendium/CategoricalResponse.Rnw b/StudentGuide/CategoricalResponse.Rnw similarity index 71% rename from Compendium/CategoricalResponse.Rnw rename to StudentGuide/CategoricalResponse.Rnw index 9c323c3..e36b415 100644 --- a/Compendium/CategoricalResponse.Rnw +++ b/StudentGuide/CategoricalResponse.Rnw @@ -12,10 +12,11 @@ so we don't need to specify it here. The more verbose usage would be \code{famil \Rindex{glm()}% \Rindex{family option}% \Rindex{exp()}% +\Rindex{msummary()}% <>= -logitmod <- glm(homeless ~ age + female, family=binomial, - data=HELPrct) -summary(logitmod) +logitmod <- glm(homeless ~ age + female, + family = binomial, data = HELPrct) +msummary(logitmod) exp(coef(logitmod)) exp(confint(logitmod)) @ @@ -25,18 +26,19 @@ might be interested in the association of homeless status and age for each of th \Rindex{anova()}% \Rindex{test option}% <>= -mymodsubage <- glm((homeless=="homeless") ~ age + substance, - family=binomial, data=HELPrct) -mymodage <- glm((homeless=="homeless") ~ age, family=binomial, - data=HELPrct) -summary(mymodsubage) +mymodsubage <- glm((homeless == "homeless") ~ age + substance, + family = binomial, data = HELPrct) +mymodage <- glm((homeless == "homeless") ~ age, family = binomial, + data = HELPrct) +msummary(mymodsubage) exp(coef(mymodsubage)) -anova(mymodage, mymodsubage, test="Chisq") +anova(mymodage, mymodsubage, test = "Chisq") @ We observe that the cocaine and heroin groups are significantly less likely to be homeless than alcohol involved subjects, after controlling for age. (A similar result is seen when considering just homeless status and substance.) <<>>= -tally(~ homeless | substance, format="percent", margins=TRUE, data=HELPrct) +tally(~ homeless | substance, format = "percent", + margins = TRUE, data = HELPrct) @ diff --git a/StudentGuide/Cover/.gitignore b/StudentGuide/Cover/.gitignore new file mode 100644 index 0000000..7d0fbd0 --- /dev/null +++ b/StudentGuide/Cover/.gitignore @@ -0,0 +1,2 @@ +StudentGuideCover.aux +StudentGuideCover.log diff --git a/Compendium/Cover/CoverTemplate.tex b/StudentGuide/Cover/CoverTemplate.tex similarity index 98% rename from Compendium/Cover/CoverTemplate.tex rename to StudentGuide/Cover/CoverTemplate.tex index 4faf09e..0a371a4 100644 --- a/Compendium/Cover/CoverTemplate.tex +++ b/StudentGuide/Cover/CoverTemplate.tex @@ -49,7 +49,7 @@ \begin{textblock*}{5}(3.5in,.6in) \parbox{130mm}{ -\textsc{\bfseries{A Compendium of Commands for Teaching Statistics Using R}} is one +\textsc{\bfseries{A Student Reference for Statistics Using R}} is one of a series of books designed to help statistics educators master integrating modern computation in their courses. We refer to our approach as {\bf computational statistics} because diff --git a/Compendium/Cover/RMarkdown-example-source.pdf b/StudentGuide/Cover/RMarkdown-example-source.pdf similarity index 100% rename from Compendium/Cover/RMarkdown-example-source.pdf rename to StudentGuide/Cover/RMarkdown-example-source.pdf diff --git a/Compendium/Cover/RMarkdown-example.Rmd b/StudentGuide/Cover/RMarkdown-example.Rmd similarity index 100% rename from Compendium/Cover/RMarkdown-example.Rmd rename to StudentGuide/Cover/RMarkdown-example.Rmd diff --git a/Compendium/Cover/RMarkdown-example.log b/StudentGuide/Cover/RMarkdown-example.log similarity index 100% rename from Compendium/Cover/RMarkdown-example.log rename to StudentGuide/Cover/RMarkdown-example.log diff --git a/Compendium/Cover/RMarkdown-example.pdf b/StudentGuide/Cover/RMarkdown-example.pdf similarity index 100% rename from Compendium/Cover/RMarkdown-example.pdf rename to StudentGuide/Cover/RMarkdown-example.pdf diff --git a/Compendium/Cover/RMarkdown-example.synctex.gz b/StudentGuide/Cover/RMarkdown-example.synctex.gz similarity index 100% rename from Compendium/Cover/RMarkdown-example.synctex.gz rename to StudentGuide/Cover/RMarkdown-example.synctex.gz diff --git a/Compendium/Cover/RMarkdown-example.tex b/StudentGuide/Cover/RMarkdown-example.tex similarity index 100% rename from Compendium/Cover/RMarkdown-example.tex rename to StudentGuide/Cover/RMarkdown-example.tex diff --git a/Compendium/Cover/CompendiumCover.pdf b/StudentGuide/Cover/StudentGuideCover.pdf similarity index 57% rename from Compendium/Cover/CompendiumCover.pdf rename to StudentGuide/Cover/StudentGuideCover.pdf index a436914..bae73da 100644 Binary files a/Compendium/Cover/CompendiumCover.pdf and b/StudentGuide/Cover/StudentGuideCover.pdf differ diff --git a/Compendium/Cover/CompendiumCover.tex b/StudentGuide/Cover/StudentGuideCover.tex similarity index 62% rename from Compendium/Cover/CompendiumCover.tex rename to StudentGuide/Cover/StudentGuideCover.tex index 693461e..d13b11c 100644 --- a/Compendium/Cover/CompendiumCover.tex +++ b/StudentGuide/Cover/StudentGuideCover.tex @@ -1,16 +1,17 @@ \documentclass{article} -\newcommand{\fullwidth}{18.875in} % including edge bleed on both sides -\newcommand{\fullheight}{8.25in} % including edge bleed on top and bottom +% All measurements in mm +\newcommand{\fullwidth}{479.4mm} % including edge bleed on both sides +\newcommand{\fullheight}{209.55mm} % including edge bleed on top and bottom % Bookmobile Template widths. Set these according to the template -\newcommand{\trim}{0.125in} -\newcommand{\flap}{3in} -\newcommand{\wrap}{0.188in} -\newcommand{\cover}{6in} -\newcommand{\spine}{0.25in} +\newcommand{\trim}{3.175} % but don't put the mm for these lengths +\newcommand{\flap}{76.2} +\newcommand{\wrap}{4.78} +\newcommand{\cover}{152.4} +\newcommand{\spine}{6.35} % Order: trim flap wrap cover spine cover wrap flap trim % These are calculated from the above -\newlength{\Xlogo} +%\newlength{\Xlogo} % Change the Xmm to move the item relative to it's default position \usepackage[paperwidth=\fullwidth,paperheight=\fullheight,margin=0in]{geometry} @@ -24,10 +25,10 @@ \usepackage{color} \usepackage{tgbonum} \usepackage{pdfpages} -\setlength{\TPHorizModule}{1in} -\setlength{\TPVertModule}{1in} +\setlength{\TPHorizModule}{1mm} +\setlength{\TPVertModule}{1mm} \definecolor{AuthorColor}{rgb}{1 1 1} -\definecolor{TitleColor}{rgb}{1 1 1 } +\definecolor{TitleColor}{rgb}{1 1 0 } % should be 1 1 1 \definecolor{RColor}{rgb}{.8 .8 .8} @@ -35,23 +36,22 @@ % for relative positioning of the item -% ISBN 978-0-9839658-9-3 +% ISBN 978-0-9839658-31 \begin{document} -\setlength{\Xlogo}{\trim+\flap+\wrap+7mm} %%%% Cover Main Part -\begin{textblock}{6.188}(3.125,0) +\begin{textblock}{6.188}(100,0) \noindent\begin{tikzpicture} -%\fill[white,opacity=.2] (0in,-8.25in) rectangle (12.81in,0in); +\fill[white,opacity=.2] (0in,-8.25in) rectangle (12.81in,0in); \end{tikzpicture} \end{textblock} %%%% Back Cover Contents -\begin{textblock*}{5}(3.5in,.6in) -\parbox{130mm}{ +\begin{textblock}{5}(95,10) +\noindent\parbox{130mm}{ -\textsc{\bfseries{A Compendium of Commands for Teaching Statistics Using R}} is one -of a series of books designed to help statistics educators master -integrating modern computation in their courses. We refer to our +\textsc{\bfseries{A Student's Guide to R}} is one +of a series of books designed to help integrate +modern computation into statistics courses. We refer to our approach as {\bf computational statistics} because the availability of computation is shaping how statistics is done, taught, and understood. Computational statistics is a key component of @@ -92,16 +92,16 @@ \medskip \noindent {\bf Other books by the authors:}\\ -{\em Using R for Data Management, Statistical Analysis and Graphics} (NJH \& KK)\\ +{\em Using R for Data Management, Statistical Analysis and Graphics (2nd edition)} (NJH \& KK)\\ {\em Foundations and Applications of Statistics: An Introduction Using R } (RJP), {\em Gems of Theoretical Computer Science} (US \& RJP), -{\em Understanding Nonlinear Dynamics} (DTK), {\em Statistical Modeling: A Fresh Approach} (DTK), {\em Start R in Calculus} (DTK) +{\em Understanding Nonlinear Dynamics} (DTK), {\em Statistical Modeling: A Fresh Approach} (DTK), {\em Start R in Calculus} (DTK), {\em Data Computing} (DTK) } -\end{textblock*} +\end{textblock} -\begin{textblock}{4}(3.5,.5) +\begin{textblock}{4}(91.5,5) \noindent\begin{tikzpicture} -\fill [white,opacity=.75] (0in,-6.1in) rectangle (5.4in,0in); +\fill [white,opacity=.75] (0in,-6.3in) rectangle (5.4in,0in); \end{tikzpicture} \end{textblock} @@ -117,57 +117,56 @@ %%%% Title -\begin{textblock*}{5.2}(237mm,1.8in) %8.6 +\begin{textblock}{52}(232,45.7) %8.6 \noindent{% \hbox{\parbox{4.5in}{\noindent\raggedleft{\textsc{\bfseries{% -\fontsize{28pt}{60pt}\selectfont\textcolor{TitleColor}{% -\hfill A Compendium of\newline \hspace{0pt} \newline\hspace{0pt}{\tiny .}\hfill -Commands to Teach% -\newline \hspace{0pt} \newline\hfill Statistics with }}}}}\bfseries{{\fontsize{150pt}{60pt}\selectfont\textcolor{RColor}{\raisebox{-.75in}{R}}}}}} -\end{textblock*} +\fontsize{42pt}{68pt}\selectfont\textcolor{TitleColor}{% +\hfill A Student's \newline \hspace{0pt} \newline\hspace{0pt}{\tiny .}\hfill +Guide% +\newline \hspace{0pt} \newline\hfill to}}}}}\hspace{5mm}\bfseries{{\fontsize{150pt}{60pt}\selectfont\textcolor{RColor}{\raisebox{-17.0mm}{R}}}}}} +\end{textblock} -\begin{textblock*}{4.5in}(285mm,5in) +\begin{textblock}{114}(285,127) \noindent{\textsc{\bfseries{\fontsize{22pt}{60pt}\selectfont - \textcolor{AuthorColor}{Randall Pruim\\\\Nicholas J. Horton\\\\Daniel T. Kaplan}}}} -\end{textblock*} + \textcolor{AuthorColor}{Nicholas J. Horton\\\\Randall Pruim\\\\Daniel T. Kaplan}}}} +\end{textblock} % Spine New -\begin{textblock*}{\spine}(\trim+\flap+\wrap+\cover+\spine/2 + 7pt,0) % was 9.44 - -\vspace*{.4in} +\begin{textblock}{\spine}(242,12) -\noindent\hspace{-.2in}\rotatebox{270}{\large\bf - \textcolor{yellow}{Start Teaching with - R}\hspace{.3in}Pruim, Horton, Kaplan - \hspace{.3in}{\textcolor{yellow}{\sf Project MOSAIC}}\hspace*{.25in}\raisebox{-1.3mm}{\includegraphics[width=0.21in]{../../CoverImages/mosaic-square.png}}} -\end{textblock*} +\noindent\hspace{-.2in}\rotatebox{270}{\Large\bf + \textcolor{yellow}{A Student's Guide to R} + \hspace{.3in}Horton, Pruim, Kaplan + \hspace{.3in}{\textcolor{yellow}{\sf Project MOSAIC}}\hspace*{.25in}\raisebox{-1.3mm}{\includegraphics[width=0.23in]{../../CoverImages/mosaic-square.png}}} +\end{textblock} %%% Logos -\begin{textblock*}{1.5}(\Xlogo,6.65in) +\begin{textblock}{1.5}(95,168.9) \noindent\includegraphics[width=1.5in]{../../CoverImages/mosaic-logo-small.png} \medskip -\noindent \rule{2pt}{0pt}\includegraphics[width=1.425in]{../../CoverImages/RStudio.png} -\end{textblock*} +\noindent\rule{2pt}{0pt}\includegraphics[width=1.425in]{../../CoverImages/RStudio.png} +\end{textblock} %%% ISBN 978-0-9839658-6-2 -\begin{textblock}{1.65}(6.6,6.85) % Same as above -\noindent\includegraphics[width=1.75in]{../../CoverImages/ISBN-8-6.png} +\begin{textblock}{1.65}(180,172) % Same as above +\noindent\includegraphics[width=1.75in]{../../CoverImages/ISBN-9780983965831-StudentGuide.pdf} \end{textblock} -\begin{textblock}{1.65}(6.6,6.85) +\begin{textblock}{1.65}(180,173) \noindent\begin{tikzpicture} \fill [white,opacity=.75] (-1.75in,-1in) rectangle (0in,0in); \end{tikzpicture} \end{textblock} + %% Back Flap -\begin{textblock*}{3}(\trim,.25in) +\begin{textblock}{3}(3.3,10) \noindent\includegraphics[width=2.9in]{backflap.pdf} -\end{textblock*} +\end{textblock} \begin{textblock}{3}(0,0) \noindent\begin{tikzpicture} @@ -177,24 +176,26 @@ % Front Flap -\begin{textblock*}{3}(\trim+\flap+\wrap+\cover+\spine+\cover+\wrap-2mm,.25in) % 15.73 -\includegraphics[width=2.9in]{frontflap.pdf} -\end{textblock*} +\begin{textblock}{3}(401,10) % 15.73 +\noindent\includegraphics[width=2.9in]{frontflap.pdf} +\end{textblock} + -\begin{textblock*}{3}(\trim+\flap+\wrap+\cover+\spine+\cover+\wrap,0) % was 15.9 +\begin{textblock}{3}(401,0) % was 15.9 \noindent\begin{tikzpicture} \fill [white,opacity=.75] (0in,0in) rectangle (3.125in,8.25in); \end{tikzpicture} -\end{textblock*} +\end{textblock} + % Front Photo -\begin{textblock}{6.125}(9.3,0) -\noindent\includegraphics[angle=90,height=8.25in,width=9.881in]{../../CoverImages/FrontMain.jpg} +\begin{textblock}{6.125}(236,0) +\noindent\includegraphics[angle=90,height=8.25in,width=9.881in]{../../Starting/Cover/FrontMain.jpg} \end{textblock} % Back Photo \begin{textblock}{6.125}(-0.2,0) -\noindent\includegraphics[angle=90,height=8.25in,width=9.881in]{../../CoverImages/BackMain.jpg} +\noindent\includegraphics[angle=90,height=8.25in,width=9.881in]{../../Starting/Cover/BackMain.jpg} \end{textblock} diff --git a/Compendium/Cover/backflap.pdf b/StudentGuide/Cover/backflap.pdf similarity index 100% rename from Compendium/Cover/backflap.pdf rename to StudentGuide/Cover/backflap.pdf diff --git a/Compendium/Cover/flaps.Rnw b/StudentGuide/Cover/flaps.Rnw similarity index 100% rename from Compendium/Cover/flaps.Rnw rename to StudentGuide/Cover/flaps.Rnw diff --git a/Compendium/Cover/flaps.tex b/StudentGuide/Cover/flaps.tex similarity index 100% rename from Compendium/Cover/flaps.tex rename to StudentGuide/Cover/flaps.tex diff --git a/Compendium/Cover/frontflap.pdf b/StudentGuide/Cover/frontflap.pdf similarity index 100% rename from Compendium/Cover/frontflap.pdf rename to StudentGuide/Cover/frontflap.pdf diff --git a/StudentGuide/Cover/frontice.docx b/StudentGuide/Cover/frontice.docx new file mode 100644 index 0000000..765a19f Binary files /dev/null and b/StudentGuide/Cover/frontice.docx differ diff --git a/StudentGuide/Cover/frontice.pdf b/StudentGuide/Cover/frontice.pdf new file mode 100644 index 0000000..97f410b Binary files /dev/null and b/StudentGuide/Cover/frontice.pdf differ diff --git a/Compendium/DataManagement.Rnw b/StudentGuide/DataManagement.Rnw similarity index 68% rename from Compendium/DataManagement.Rnw rename to StudentGuide/DataManagement.Rnw index bdbcd31..1d9ad4d 100644 --- a/Compendium/DataManagement.Rnw +++ b/StudentGuide/DataManagement.Rnw @@ -1,32 +1,47 @@ \label{sec:manipulatingData}% -\myindex{data management}% +\myindex{data wrangling}% +\myindex{wrangling data}% \myindex{thinking with data}% -\TeachingTip{The \emph{Start Teaching with R} book features an extensive section on data management, including use of the \function{read.file()} function to load data into \R\ and \RStudio.} +\FoodForThought{The \emph{Start Teaching with R} book features an extensive section on data management, including use of the \function{read.file()} function to load data into \R\ and \RStudio.} \vspace*{-1cm} -Data management is a key capacity to allow students (and instructors) to ``compute with data'' or +Data wrangling (also known as management, curation, or marshaling) is a key capacity to allow students (and instructors) to ``compute with data'' or as Diane Lambert of Google has stated, ``think with data''. -We tend to keep student data management to a minimum during the early part of an introductory +We tend to keep student data wrangling to a minimum during the early part of an introductory statistics course, then gradually introduce topics as needed. For courses where students -undertake substantive projects, data management is more important. This chapter describes +undertake substantive projects, more focus on data management is needed. This chapter describes some key data management tasks. \myindex{read.file()}% -\TeachingTip{The \pkg{dplyr} and \pkg{tidyr} packages provide an elegant approach to data management and facilitate the ability of students to compute with data. Hadley Wickham, author of the packages, +\FoodForThought{The \pkg{dplyr} and \pkg{tidyr} packages provide an elegant approach to data management and facilitate the ability of students to compute with data. Hadley Wickham, author of the packages, suggests that there are six key idioms (or verbs) implemented within these packages that allow a large set of tasks to be accomplished: filter (keep rows matching criteria), select (pick columns by name), arrange (reorder rows), mutate (add new variables), summarise (reduce variables to values), and -group by (collapse groups).} +group by (collapse groups). +See \url{https://nhorton.people.amherst.edu/precursors} for more details and resources.} + +\section{Inspecting dataframes} +\myindex{inspecting dataframes}% +\myindex{dataframes!inspecting}% +The \function{inspect()} function can be helpful in describing the variables in a dataframe (the name for a dataset in \R). +\Rindex{inspect()}% +<<>>= +inspect(iris) +@ +The \dataframe{iris} dataframe includes one categorical and four quantitative variables. + + \section{Adding new variables to a dataframe} \myindex{dataframe}% -We can add additional variables to an existing dataframe (name for a dataset in \R) using \function{mutate()}. But first we create a smaller version of the \dataframe{iris} dataframe. +We can add additional variables to an existing dataframe using \function{mutate()}. But first we create a smaller version of the \dataframe{iris} dataframe. +\Rindex{select()}% \myindex{iris dataset}% <>= irisSmall <- select(iris, Species, Sepal.Length) @@ -38,10 +53,22 @@ irisSmall <- select(iris, Species, Sepal.Length) <>= # cut places data into bins irisSmall <- mutate(irisSmall, - Length = cut(Sepal.Length, breaks=4:8)) + Length = cut(Sepal.Length, breaks = 4:8)) +@ + +\myindex{pipe operator}% +\Rindex{\%>\%}% +Multiple commands can be chained together using the {\tt \%>\%} (pipe) operator: +<<>>= +irisSmall <- iris %>% + select(Species, Sepal.Length) %>% + mutate(Length = cut(Sepal.Length, breaks = 4:8)) @ +Note that in this usage the first argument to \function{select} is the first variable +(as it inherits the data from the previous pipe). + -\TeachingTip[1cm]{The \function{cut()} function has an option \option{labels} which can be used to specify more descriptive names for the groups.} +\FoodForThought[1cm]{The \function{cut()} function has an option \option{labels} which can be used to specify more descriptive names for the groups.} <<"mr-adding-variable2-again">>= head(irisSmall) @ @@ -59,12 +86,12 @@ using \function{mutate()}. \Rindex{mutate()}% <<>>= CPS85 <- mutate(CPS85, workforce.years = age - 6 - educ) -favstats(~ workforce.years, data=CPS85) +favstats(~ workforce.years, data = CPS85) @ In fact this is what was done for all but one of the cases to create the \variable{exper} variable that is already in the \dataframe{CPS85} data. <<>>= -tally(~ (exper - workforce.years), data=CPS85) +tally(~ (exper - workforce.years), data = CPS85) @ \section{Dropping variables} @@ -94,7 +121,7 @@ The column (variable) names for a dataframe can be changed using the \function{r \pkg{dplyr} package. <<>>= names(CPS85) -CPSnew = rename(CPS85, workforce=workforce.years) +CPSnew <- rename(CPS85, workforce = workforce.years) names(CPSnew) @ @@ -105,23 +132,23 @@ simple assignment using \function{row.names()}. \myindex{faithful dataset}% The \dataframe{faithful} data set (in the \pkg{datasets} package, which is always available) has very unfortunate names. -\TeachingTip{It's a good idea to start teaching good practices for choice of variable names from day one.} +\FoodForThought{It's a good idea to establish practices for choice of variable names from day one.} <<>>= names(faithful) @ -The measurements are the duration of an euption and the time until the subsequent eruption, +The measurements are the duration of an eruption and the time until the subsequent eruption, so let's give it some better names. <>= faithful <- rename(faithful, - duration = eruptions, - time.til.next=waiting) + duration = eruptions, # new = old + time.til.next = waiting) names(faithful) @ \myindex{faithful dataset}% \begin{center} <<"mr-faithful-xy">>= -xyplot(time.til.next ~ duration, alpha=0.5, data=faithful) +gf_point(time.til.next ~ duration, alpha = 0.5, data = faithful) @ \end{center} If the variable containing a dataframe is modified or used to store a different object, @@ -140,7 +167,8 @@ since the previous eruption. \section{Creating subsets of observations} \myindex{creating subsets}% -\myindex{subsets of dataframes}% +\myindex{subsetting dataframes}% +\myindex{dataframes!subsetting}% \label{sec:subsets} We can also use \function{filter()} to reduce the size of a dataframe by selecting only certain rows. @@ -150,13 +178,14 @@ data(faithful) names(faithful) <- c('duration', 'time.til.next') # any logical can be used to create subsets faithfulLong <- filter(faithful, duration > 3) -xyplot( time.til.next ~ duration, data=faithfulLong ) +gf_point(time.til.next ~ duration, data = faithfulLong) @ \end{center} \section{Sorting dataframes} \myindex{sorting dataframes}% +\myindex{dataframes!sorting}% \Rindex{arrange()}% Data frames can be sorted using the \function{arrange()} function. @@ -172,17 +201,17 @@ head(sorted, 3) \section{Merging datasets} \myindex{merging dataframes}% +\myindex{dataframes!merging}% <>= -OLD <- options(width=90) +OLD <- options(width = 90) @ The \dataframe{fusion1} dataframe in the \pkg{fastR} package contains -genotype information for a SNP (single nucleotide polymorphism) in the gene \emph{TCF7L2}. The \dataframe{pheno} dataframe contains phenotypes (includingtype 2 diabetes case/control status) for an intersecting set of individuals. We can join (or merge) these together to explore the association between genotypes and phenotypes using \verb!merge()!. +genotype information for a SNP (single nucleotide polymorphism) in the gene \emph{TCF7L2}. The \dataframe{pheno} dataframe contains phenotypes (including type 2 diabetes case/control status) for an intersecting set of individuals. We can join (or merge) these together to explore the association between genotypes and phenotypes using \verb!merge()!. \Rindex{arrange()}% <>= -require(fastR) -require(dplyr) +library(fastR) fusion1 <- arrange(fusion1, id) head(fusion1, 3) head(pheno, 3) @@ -192,8 +221,8 @@ head(pheno, 3) \Rindex{all.x option}% \Rindex{by.x option}% <>= -require(tidyr) -fusion1m <- inner_join(fusion1, pheno, by='id') +library(tidyr) +fusion1m <- inner_join(fusion1, pheno, by = 'id') head(fusion1m, 3) @ \Rindex{tidyr package}% @@ -201,7 +230,7 @@ head(fusion1m, 3) \myindex{fusion1 dataset}% Now we are ready to begin our analysis. <<"mr-fusion1-xtabs">>= -tally(~t2d + genotype, data=fusion1m) +tally(~ t2d + genotype, data = fusion1m) @ \begin{problem} @@ -223,12 +252,13 @@ have in your final dataframe. \section{Slicing and dicing} \myindex{reshaping dataframes}% +\myindex{dataframes!reshaping}% \myindex{transforming dataframes}% \myindex{transposing dataframes}% The \pkg{tidyr} package provides a flexible way to change the arrangement of data. It was designed for converting between long and wide versions of time series data and its arguments are named with that in mind. -\TeachingTip{The vignettes that accompany the \pkg{tidyr} and \pkg{dplyr} packages feature a number of useful examples of common data manipulations.} +\FoodForThought{The vignettes that accompany the \pkg{tidyr} and \pkg{dplyr} packages feature a number of useful examples of common data manipulations.} A common situation is when we want to convert from a wide form to a @@ -254,7 +284,7 @@ form a row in the dataframe. <>= stateTraffic <- longTraffic %>% select(year, deathRate, state) %>% - mutate(year=paste("deathRate.", year, sep="")) %>% + mutate(year = paste("deathRate.", year, sep = "")) %>% spread(year, deathRate) stateTraffic @ @@ -279,20 +309,31 @@ with cuts at 20 and 40 for the CESD scale (which ranges from 0 to 60 points). \Rindex{include.lowest option}% \Rindex{breaks option}% <>= -favstats(~ cesd, data=HELPrct) +favstats(~ cesd, data = HELPrct) HELPrct <- mutate(HELPrct, cesdcut = cut(cesd, - breaks=c(0, 20, 40, 60), include.lowest=TRUE)) -bwplot(cesd ~ cesdcut, data=HELPrct) + breaks = c(0, 20, 40, 60), include.lowest = TRUE)) +gf_boxplot(cesd ~ cesdcut, data = HELPrct) @ \Rindex{ntiles()}% -\TeachingTip{The \function{ntiles} function can be used to automate creation of groups in this manner.} +\FoodForThought{The \function{ntiles} function can be used to automate creation of groups in this manner.} It might be preferable to give better labels. <>= HELPrct <- mutate(HELPrct, cesdcut = cut(cesd, - labels=c("low", "medium", "high"), - breaks=c(0, 20, 40, 60), include.lowest=TRUE)) -bwplot(cesd ~ cesdcut, data=HELPrct) + labels = c("low", "medium", "high"), + breaks = c(0, 20, 40, 60), include.lowest = TRUE)) +gf_boxplot(cesd ~ cesdcut, data = HELPrct) +@ + +The \function{case_when()} function is even more general and can also be used for this purpose. + +\Rindex{case_when()}% +<<>>= +HELPrct <- mutate(HELPrct, + anothercut = case_when( + cesd >= 0 & cesd <= 20 ~ "low", + cesd > 20 & cesd <= 40 ~ "medium", + cesd > 40 ~ "high")) @ @@ -306,11 +347,11 @@ bwplot(cesd ~ cesdcut, data=HELPrct) By default R uses the first level in lexicographic order as the reference group for modeling. This can be overriden using the \function{relevel()} function (see also \function{reorder()}). <>= -tally(~ substance, data=HELPrct) -coef(lm(cesd ~ substance, data=HELPrct)) +tally(~ substance, data = HELPrct) +coef(lm(cesd ~ substance, data = HELPrct)) HELPrct <- mutate(HELPrct, subnew = relevel(substance, - ref="heroin")) -coef(lm(cesd ~ subnew, data=HELPrct)) + ref = "heroin")) +coef(lm(cesd ~ subnew, data = HELPrct)) @ \section{Group-wise statistics} @@ -328,14 +369,18 @@ the median age of subjects by substance group. \Rindex{group\_by()}% \Rindex{left\_join()}% \Rindex{summarise()}% +\Rindex{nrow()}% <>= -favstats(age ~ substance, data=HELPrct) +favstats(age ~ substance, data = HELPrct) ageGroup <- HELPrct %>% group_by(substance) %>% summarise(agebygroup = mean(age)) ageGroup -HELPmerged <- left_join(ageGroup, HELPrct, by="substance") -favstats(agebygroup ~ substance, data=HELPmerged) +nrow(ageGroup) +nrow(HELPrct) +HELPmerged <- left_join(ageGroup, HELPrct, by = "substance") +favstats(agebygroup ~ substance, data = HELPmerged) +nrow(HELPmerged) @ @@ -365,12 +410,17 @@ Of the 470 subjects in the 6 variable dataframe, only the \code{drugrisk}, \code \Rindex{na.omit()}% \Rindex{favstats()}% \Rindex{is.na()}% +\Rindex{sum()}% +\Rindex{nrow()}% +\Rindex{ncol()}% <>= -favstats(~ mcs, data=smaller) +favstats(~ mcs, data = smaller) with(smaller, sum(is.na(mcs))) nomiss <- na.omit(smaller) dim(nomiss) -favstats(~ mcs, data=nomiss) +nrow(nomiss) +ncol(nomiss) +favstats(~ mcs, data = nomiss) @ Alternatively, we could generate the same dataset using logical conditions. @@ -382,4 +432,4 @@ dim(nomiss) <>= options(OLD) -@ \ No newline at end of file +@ diff --git a/StudentGuide/GettingStarted.Rnw b/StudentGuide/GettingStarted.Rnw new file mode 100644 index 0000000..a0838bb --- /dev/null +++ b/StudentGuide/GettingStarted.Rnw @@ -0,0 +1,308 @@ + +\label{chap:RStudio} + +\RStudio\ is an integrated development environment (IDE) for \R\ that provides an alternative +interface to \R\ that has several advantages over other the default \R\ interfaces: +\FoodForThought{A series of getting started videos are available at \url{https://nhorton.people.amherst.edu/rstudio}.} +\begin{itemize} + \item \RStudio\ runs on Mac, PC, and Linux machines and provides + a simplified interface that + \emph{looks and feels identical on all of them.} + + The default interfaces for \R\ are quite different on the various platforms. This + is a distractor for students and adds an extra layer of support responsibility + for the instructor. + \item + \RStudio\ can run in a web browser. + + In addition to stand-alone desktop versions or in \url{RStudio.cloud}, \RStudio\ + can be set up as a server application that is accessed via the internet. + + The web interface is nearly identical to the desktop version.% + \Caution{The desktop and server version of \RStudio\ are so similar + that if you run them both, you will have to pay careful attention to make + sure you are working in the one you intend to be working in.} + As with other web services, users login to access their account. + If students logout and login in again later, even on a different machine, + their session is restored and they can resume their analysis + right where they left off. + With a little advanced set up, instructors can save the history of their + classroom \R\ use and students can load those history files into their own + environment.% + \Note{Using \RStudio\ in a browser is like Facebook for statistics. + Each time the user returns, the previous session is restored and they + can resume work where they left off. Users can login from any device + with internet access.}% + \item + \RStudio\ provides support for reproducible research. + + \RStudio\ makes it easy to include text, statistical analysis (\R\ code + and \R\ output), and graphical displays all in the same document. + The RMarkdown system provides a simple markup language and renders the + results in HTML. The \pkg{knitr}/\LaTeX\ system + allows users to combine \R\ and \LaTeX\ in the same document. The + reward for learning this more complicated system is much finer control + over the output format. Depending on the level of the course, + students can use either of these for homework and projects. + \authNote{NH (via rjp): Add some pointers to more information?} + \marginnote{To use Markdown or \pkg{knitr}/\LaTeX\ requires that + the \pkg{knitr} package be installed on your system. } + + + \item + \RStudio\ provides an integrated support for editing and executing \R\ + code and documents. + + \item + \RStudio\ provides some useful functionality via a graphical user interface. + + \RStudio\ is not a GUI for \R, but it does provide a GUI that simplifies things + like installing and updating packages; monitoring, saving and loading environments; + importing and exporting data; browsing and exporting graphics; and browsing files and + documentation. + + + \item + \RStudio\ provides access to the \pkg{manipulate} package. + + The \pkg{manipulate} package provides a way to create simple interactive + graphical applications quickly and easily. + +\end{itemize} +While one can certainly use \R\ without using \RStudio, \RStudio\ makes a number +of things easier and we highly recommend using \RStudio. Furthermore, since \RStudio\ +is in active development, we fully expect more useful features in the future. + + +We primarily use an online version of \Rstudio. \Rstudio\ is a innovative and +powerful interface to \R\ that runs in a web browser or on your local machine. +Running in the browser has the advantage that you don't have to install or +configure anything. Just login and you are good to go. Furthermore, \Rstudio\ +will ``remember'' what you were doing so that each time you login (even on a +different machine) you can pick up right where you left off. This is ``\R\ in +the cloud" and works a bit like GoogleDocs or Facebook for \R. + +\R\ can also be obtained from \url{http://cran.r-project.org/}. +Download and installation are pretty straightforward for Mac, PC, or Linux machines. +\RStudio\ is available from \url{http://www.rstudio.org/}. + + + +\section{Connecting to an RStudio server} + +\RStudio\ servers have been set up at a number of schools to facilitate cloud-based computing. +\FoodForThought{\RStudio\ servers have been installed at many institutions. +More details about (free) academic licenses for \RStudio\ Server Pro as well as setup instructions can be found at \url{http://www.rstudio.com/resources/faqs} under the {\tt Academic} tab. +} + +Once you connect to the server, you should see a login screen: +\FoodForThought{The \RStudio\ server doesn't tend to work well with Internet Explorer.} + +\includegraphics[width=4.34in]{rstudio-login.png} + +Once you authenticate, +you should see the \RStudio\ interface: + +\includegraphics[width=4.34in]{r-interface.jpg} + +Notice that \Rstudio\ divides its world into four panels. Several of the panels +are further subdivided into multiple tabs. Which tabs appear in which panels +can be customized by the user. + +\R\ can do much more than a simple calculator, and we will introduce +additional features in due time. But performing simple calculations in \R\ is a +good way to begin learning the features of \RStudio. + +Commands entered in the \tab{Console} tab are immediately executed by \R. +A good way to familiarize yourself with the console is to do some simple +calculator-like computations. Most of this will work just like you would +expect from a typical calculator. +Try typing the following commands in the console panel. + +<>= +5 + 3 +15.3 * 23.4 +sqrt(16) # square root +@ + + +This last example demonstrates how functions are called within \R\ as +well as the use of comments. +Comments are prefaced with the \verb!#! character. +Comments can be very helpful when writing scripts +with multiple commands or to annotate example code for your students. + +You can save values to named variables for later reuse. +\FoodForThought{It's probably best to settle on using +one or the other of the right-to-left assignment operators rather than to switch +back and forth. We prefer the +arrow operator because it +represents visually what is happening in an assignment +and because it makes +a clear distinction between the assignment operator, the use of \code{=} +to provide values to arguments of functions, and the use of \code{==} to test +for equality.}% + +<>= +product = 15.3 * 23.4 # save result +product # display the result +product <- 15.3 * 23.4 # <- can be used instead of = +product +@ + + +Once variables are defined, they can be referenced in other operations +and functions. + +<>= +0.5 * product # half of the product +log(product) # (natural) log of the product +log10(product) # base 10 log of the product +log2(product) # base 2 log of the product +log(product, base = 2) # base 2 log of the product, another way +@ + +The semi-colon can be used to place multiple commands on one line. +One frequent use of this is to save and print a value all in one go: + +<>= +product <- 15.3 * 23.4; product # save result and show it +@ + + +\subsection{Version information} + +\Rindex{sessionInfo()}% +\Rindex{RStudio.Version()}% +At times it may be useful to check what version of the \pkg{mosaic} package, \R, and +\RStudio you are using. Running \function{sessionInfo()} will display information about the version of R and packages that are loaded and \function{RStudio.Version()} will provide information about the version of \RStudio. + +<<>>= +sessionInfo() +@ + +\section{Working with Files} + +\subsection{Working with \R\ Script Files} +As an alternative, \R\ commands can be stored in a file. \RStudio\ provides +an integrated editor for editing these files and facilitates executing some or all of +the commands. To create a file, select \tab{File}, then \tab{New File}, then \tab{R Script} +from the \RStudio\ menu. A file editor tab will open in the \tab{Source} panel. +\R\ code can be entered here, and +buttons and menu items are provided to run all the code (called sourcing the file) or +to run the code on a single line or in a selected section of the file. + +\subsection{Working with RMarkdown, and knitr/\LaTeX} +A third alternative is to take advantage of \RStudio's support for reproducible research. +If you already know \LaTeX, you will want to investigate the \pkg{knitr}/\LaTeX\ capabilities. +For those who do not already know \LaTeX, the simpler RMarkdown system provides an easy +entry into the world of reproducible research methods. It also provides a good facility +for students to create homework and reports that include text, \R\ code, \R\ output, and graphics. + +To create a new RMarkdown file, select \tab{File}, then \tab{New File}, then \tab{RMarkdown}. +The file will be opened with a short template document that illustrates the mark up language. + +\includegraphics[width=4.34in]{markdown1.png} + + +The \pkg{mosaic} package includes two useful RMarkdown templates for getting started: {\tt fancy} includes bells and whistles (and is intended to give an overview of features), while {\tt plain} is useful as a starting point for a new analysis. These +are accessed using the {\tt Template} option when creating a new RMarkdown file. + +\includegraphics[width=4.34in]{markdown2.png} + + +Click on the \tab{Knit} button to convert to an HTML, PDF, or Word file. + +\includegraphics[width=4.34in]{markdown3.png} + +This will generate a formatted version of the document. + +\includegraphics[width=4.34in]{r-markdown.jpg} + +There is a button (marked with a question mark) which +provides a brief description of the supported markup commands. The \RStudio\ web site +includes more extensive tutorials on using RMarkdown. + +\Caution{RMarkdown, and \pkg{knitr}/\LaTeX\ files do not have access to the console environment, +so the code in them must be self-contained.} +% +It is important to remember that unlike \R\ scripts, which are executed in the +console and have access to the console environment, RMarkdown and \pkg{knitr}/\LaTeX\ +files do not have access to the console environment This is a good feature because it forces +the files to be self-contained, which makes them transferable and respects good +reproducible research practices. But beginners, especially if they adopt a +strategy of trying things out in the console and copying and pasting successful +code from the console to their file, will often create files that are +incomplete and therefore do not compile correctly. + + +\section{The Other Panels and Tabs} + +\subsection{The History Tab} + +As commands are entered in the console, they appear in the \tab{History} tab. +These histories can be saved and loaded, there is a search feature to locate +previous commands, and individual lines or sections can be transferred back to +the console. Keeping the \tab{History} tab open will allow you to go +back and see the previous several commands. This can be especially useful when +commands produce a fair amount of output and so scroll off the screen rapidly. + +\subsection{Communication between tabs} + +\RStudio\ provides several ways to move \R\ code between tabs. Pressing the \tab{Run} button +in the editing panel for an \R\ script or RMarkdown or other file will copy lines of code +into the Console and run them. +\subsection{The Files Tab} +The \tab{Files} tab provides a simple file manager. It can be navigated in familiar ways +and used to open, move, rename, and delete files. In the browser version of \RStudio, +the \tab{Files} tab also provides a file upload utility for moving files from the local +machine to the server. +In RMarkdown and knitr files one can also run the code in a particular chunk or in all of the +chunks in a file. Each of these features makes it easy to try out code ``live'' while +creating a document that keeps a record of the code. + +In the reverse direction, code from the history can be copied either back into the console +to run them again (perhaps after editing) or into one of the file editing tabs for inclusion +in a file. + + +\subsection{The Help Tab} +The \tab{Help} tab is where \RStudio\ displays \R\ help files. These can be searched and navigated +in the \tab{Help} tab. You can also open a help file using the \texttt{?} operator in the console. +For example the following command +will provide the help file for the logarithm function. +<>= +?log +@ + +\subsection{The Environment Tab} +The \tab{Environment} tab shows the objects available to the console. These are +subdivided into data, values (non-dataframe, non-function objects) and +functions. +The broom icon can be used to remove all objects from the environment, and it is good +to do this from time to time, especially when running in RStudio server or if you +choose to save the environment when shutting down \RStudio\, since in these cases objects +can stay in the environment essentially indefinitely. + +\subsection{The Plots Tab} +Plots created in the console are displayed in the \tab{Plots} tab. For example, +the following commands display the number of births in the United States for each day in 1978. +<>= +library(mosaic) +gf_point(births ~ dayofyear, data = Births78) +@ +From the \tab{Plots} tab, you can navigate to previous plots and also export plots +in various formats after interactively resizing them. + + +% this fixes bad spacing -- but I don't know why the spacing was bad +%\bigskip +\subsection{The Packages Tab} + +Much of the functionality of \R\ is located in packages, many of which can be obtained +from a central clearing house called CRAN (Comprehensive R Archive Network). The \tab{Packages} +tab facilitates installing and loading packages. It will also allow you to search for +packages that have been updated since you installed them. + + + diff --git a/Compendium/HELP-Study.Rnw b/StudentGuide/HELP-Study.Rnw similarity index 96% rename from Compendium/HELP-Study.Rnw rename to StudentGuide/HELP-Study.Rnw index d20faed..b65de0a 100644 --- a/Compendium/HELP-Study.Rnw +++ b/StudentGuide/HELP-Study.Rnw @@ -37,8 +37,8 @@ The \pkg{mosaicData} package contains several forms of the de-identified HELP da We will focus on \pkg{HELPrct}, which contains 27 variables for the 453 subjects with minimal missing data, primarily at baseline. -Variables included in the HELP dataset are described in Table \ref{tab:helpvars}. More information can be found here\cite{Horton:2011:R}. -A copy of the study instruments can be found at: \url{http://www.amherst.edu/~nhorton/help}. +Variables included in the HELP dataset are described in Table \ref{tab:helpvars}. More information can be found at: \url{https://nhorton.people.amherst.edu/r2}. +A copy of the study instruments can be found at: \url{https://nhorton.people.amherst.edu/help}. \begin{longtable}{|p{2.1cm}|p{6.8cm}|p{3.5cm}|} \caption{Annotated description of variables in the \dataframe{HELPrct} dataset} \label{tab:helpvars} \\ diff --git a/Compendium/Introduction.Rnw b/StudentGuide/Introduction.Rnw similarity index 50% rename from Compendium/Introduction.Rnw rename to StudentGuide/Introduction.Rnw index 7608338..222ab09 100644 --- a/Compendium/Introduction.Rnw +++ b/StudentGuide/Introduction.Rnw @@ -1,41 +1,41 @@ \vspace*{-.5cm} -In this monograph, we briefly review the commands and functions needed to analyze data from introductory and second courses in statistics. This is intended to complement the \emph{Start Teaching with R} and \emph{Start Modeling with R} books. +In this reference book, we briefly review the commands and functions needed to analyze data from introductory and second courses in statistics. This is intended to complement the \emph{Start Teaching with R} and \emph{Start Modeling with R} books. Most of our examples will use data from the HELP (Health Evaluation and Linkage to Primary Care) study: a randomized clinical trial of a novel way to link at-risk subjects with primary care. More information on the dataset can be found in chapter \ref{sec:help}. Since the selection and order of topics can vary greatly from textbook to textbook and instructor to instructor, we have chosen to -organize this material by the kind of data being analyzed. This should make it straightforward to find what you are looking for even if you present things in a different order. This is also a good organizational template to give your students to help them keep straight ``what to do when". - -Some data management is needed by students (and more by instructors). This -material is reviewed in Chapter \ref{sec:manipulatingData}. +organize this material by the kind of data being analyzed. This should make it straightforward to find what you are looking for. +Some data management skills are needed by students\cite{hort:2015}. A basic introduction to key idioms +is provided in Chapter \ref{sec:manipulatingData}. \myindex{vignettes}% -This work leverages initiatives undertaken by Project MOSAIC (\url{http://www.mosaic-web.org}), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the \pkg{mosaic} package, which was written to simplify the use of \R\ for introductory statistics courses, and the \pkg{mosaicData} package which includes a number of data sets. A short summary of the \R\ commands needed to teach introductory statistics can be found in the mosaic package vignette:\\ -\verb+http://cran.r-project.org/web/packages/mosaic/vignettes/mosaic-resources.pdf+ +This work leverages initiatives undertaken by Project MOSAIC (\url{http://www.mosaic-web.org}), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the \pkg{mosaic} package, which was written to simplify the use of \R\ for introductory statistics courses, and the \pkg{mosaicData} package which includes a number of data sets. The \pkg{ggformula} package provides support for high quality graphics using the mosaic modeling language. A paper describing the mosaic approach to teaching statistics and data science can be found at \url{https://journal.r-project.org/archive/2017/RJ-2017-024}. A short summary of the \R\ commands needed to teach introductory statistics can be found in the mosaic package vignette: +\url{https://cran.r-project.org/web/packages/mosaic}. -Other related resources from Project MOSAIC may be helpful, including an annotated set of examples from the sixth edition of Moore, McCabe and Craig's \emph{Introduction to the Practice of Statistics}\cite{moor:mcca:2007} (see \url{http://www.amherst.edu/~nhorton/ips6e}), the second and third editions of the \emph{Statistical Sleuth}\cite{Sleuth2} (see \url{http://www.amherst.edu/~nhorton/sleuth}), and \emph{Statistics: Unlocking the Power of Data} by Lock et al (see \url{https://github.com/rpruim/Lock5withR}). +Other related resources from Project MOSAIC may be helpful, including an annotated set of examples from a number of textbooks (see \url{https://cran.r-project.org/web/packages/mosaic/vignettes/mosaic-resources.html}). \myindex{installing packages}% \Rindex{install.packages()}% -To use a package within R, it must be installed (one time), and loaded (each session). The \pkg{mosaic} and \pkg{mosaicData} packages can be installed using the following commands: +To use a package within R, it must be installed (one time), and loaded (each session). The \pkg{mosaic} package can be installed using the following commands: <>= install.packages("mosaic") # note the quotation marks @ -\TeachingTip[-1.5cm]{\Rstudio\ features a simplified package installation tab (in the bottom right panel).} The {\tt \#} character is a comment in R, and all text after that on the current line is ignored. +\FoodForThought[-1.5cm]{\Rstudio\ features a simplified package installation tab (in the bottom right panel).} The {\tt \#} character is a comment in R, and all text after that on the current line is ignored. \myindex{loading packages}% +\Rindex{library()}% \Rindex{require()}% Once the package is installed (one time only), it can be loaded by running the command: <>= -require(mosaic) -require(mosaicData) +library(mosaic) +# require(mosaic) can also be used to load packages @ -\TeachingTip[-3cm]{The \pkg{knitr}/\LaTeX\ system allows experienced users to combine \R\ and \LaTeX\ in the same document. The reward for learning this more complicated system is much finer control over the output format. But RMarkdown is much easier to learn and is adequate even for professional-level work.}% +\FoodForThought[-3cm]{The \pkg{knitr}/\LaTeX\ system allows experienced users to combine \R\ and \LaTeX\ in the same document. The reward for learning this more complicated system is much finer control over the output format. But RMarkdown is much easier to learn and is adequate even for professional-level work.}% \myindex{reproducible analysis}% diff --git a/Compendium/Compendium.Rnw b/StudentGuide/MOSAIC-StudentGuide.Rnw similarity index 89% rename from Compendium/Compendium.Rnw rename to StudentGuide/MOSAIC-StudentGuide.Rnw index 0ecf746..b16d07d 100644 --- a/Compendium/Compendium.Rnw +++ b/StudentGuide/MOSAIC-StudentGuide.Rnw @@ -1,7 +1,7 @@ \documentclass{tufte-book} %[openany] option unselected -\usepackage{RBook} +\usepackage{../include/RBook} \usepackage{pdfpages} %\usepackage[shownotes]{authNote} \usepackage[hidenotes]{authNote} @@ -82,9 +82,9 @@ -\title{A Compendium of Commands to Teach Statistics with R} +\title{A Student's Guide to R} \author[Horton, Kaplan, Pruim]{ Nicholas J. Horton, Daniel Kaplan, and Randall Pruim} -\date{January 2015} +\date{June 2018} \begin{document} \def\cplabel{^X} @@ -96,28 +96,26 @@ <>= #setCacheDir("cache") -require(MASS) -require(grDevices) -require(datasets) -require(stats) -require(lattice) -require(grid) +# require(MASS) # commented out by NJH 9/27/2015 +library(grDevices) +library(datasets) +library(stats) +library(lattice) +library(grid) # require(fastR) # commented out by NH on 7/12/2012 -require(mosaic) -require(mosaicData) -trellis.par.set(theme=col.mosaic(bw=FALSE)) -trellis.par.set(fontsize=list(text=9)) +library(mosaic) options(format.R.blank=FALSE) options(width=70) -require(vcd) -require(knitr) +options(continue=" ") +library(knitr) opts_chunk$set( tidy=FALSE, size='small', dev="pdf", fig.path="figures/fig-", - fig.width=3, fig.height=2, + fig.width=4, fig.height=2.4, fig.align="center", fig.show="hold", + prompt=TRUE, comment=NA) knit_theme$set("greyscale0") # For printing in black and white opts_chunk$set( fig.path="figure/Core-fig-" ) @@ -143,10 +141,10 @@ knit_hooks$set(document = function(x) { \newpage \vspace*{2in} -\parbox{4in}{\noindent Copyright (c) 2015 by Randall Pruim, Nicholas Horton, \& Daniel Kaplan.} +\parbox{4in}{\noindent Copyright (c) 2018 by Nicholas J. Horton, Randall Pruim, \& Daniel Kaplan.} \medskip -\parbox{4in}{\noindent Edition 1.0, January 2015} +\parbox{4in}{\noindent Edition 1.3, June 2018} \bigskip @@ -163,13 +161,17 @@ knit_hooks$set(document = function(x) { \hspace*{1.3cm}\tableofcontents -<>= +<>= @ \chapter{Introduction} <>= @ +\chapter{Getting Started with RStudio} +<>= +@ + \chapter{One Quantitative Variable} <>= @ @@ -210,7 +212,7 @@ knit_hooks$set(document = function(x) { <>= @ -\chapter{Data Management} +\chapter{Data Wrangling} <>= @ @@ -221,6 +223,8 @@ knit_hooks$set(document = function(x) { \chapter{Exercises and Problems} +The first part of the exercise number indicates which chapter it comes from. + \shipoutProblems diff --git a/StudentGuide/MOSAIC-StudentGuide.pdf b/StudentGuide/MOSAIC-StudentGuide.pdf new file mode 100644 index 0000000..fb1bd77 Binary files /dev/null and b/StudentGuide/MOSAIC-StudentGuide.pdf differ diff --git a/Compendium/MoreThanTwoVars.Rnw b/StudentGuide/MoreThanTwoVars.Rnw similarity index 66% rename from Compendium/MoreThanTwoVars.Rnw rename to StudentGuide/MoreThanTwoVars.Rnw index a52f08e..9608366 100644 --- a/Compendium/MoreThanTwoVars.Rnw +++ b/StudentGuide/MoreThanTwoVars.Rnw @@ -3,26 +3,32 @@ We can fit a two (or more) way ANOVA model, without or with an interaction, using the same modeling syntax. +\Rindex{median()}% +\Rindex{gf\_boxplot()}% +\Rindex{factor()}% +\Rindex{mutate()}% +\Rindex{aov()}% <>= -median(cesd ~ substance | sex, data=HELPrct) -bwplot(cesd ~ subgrp | sex, data=HELPrct) +HELPrct <- mutate(HELPrct, subgrp = factor(substance, + levels = c("alcohol", "cocaine", "heroin"), + labels = c("A", "C", "H"))) +median(cesd ~ substance | sex, data = HELPrct) +gf_boxplot(cesd ~ subgrp | sex, data = HELPrct) @ <>= -summary(aov(cesd ~ substance + sex, data=HELPrct)) +summary(aov(cesd ~ substance + sex, data = HELPrct)) @ <>= -summary(aov(cesd ~ substance * sex, data=HELPrct)) +summary(aov(cesd ~ substance * sex, data = HELPrct)) @ There's little evidence for the interaction, though there are statistically significant main effects terms for \variable{substance} group and \variable{sex}. - +\Rindex{plotModel()}% <>= -xyplot(cesd ~ substance, groups=sex, - auto.key=list(columns=2, lines=TRUE, points=FALSE), type='a', - data=HELPrct) +mod <- lm(cesd ~ substance + sex + substance * sex, data = HELPrct) +plotModel(mod) @ -\Rindex{auto.key option} \section{Multiple regression} @@ -41,17 +47,17 @@ regularly. The motivation for this is described at length in the companion volu Here we consider a model (parallel slopes) for depressive symptoms as a function of Mental Component Score (MCS), age (in years) and sex of the subject. -\newpage - +\myindex{msummary()}% <>= -lmnointeract <- lm(cesd ~ mcs + age + sex, data=HELPrct) -summary(lmnointeract) +lmnointeract <- lm(cesd ~ mcs + age + sex, data = HELPrct) +msummary(lmnointeract) @ +\myindex{anova()}% \myindex{interactions}% We can also fit a model that includes an interaction between MCS and sex. <>= -lminteract <- lm(cesd ~ mcs + age + sex + mcs:sex, data=HELPrct) -summary(lminteract) +lminteract <- lm(cesd ~ mcs + age + sex + mcs:sex, data = HELPrct) +msummary(lminteract) anova(lminteract) @ <>= @@ -67,27 +73,30 @@ this from the model. \Rindex{plotFun()}% \Rindex{makeFun()}% The \function{makeFun()} and \function{plotFun()} functions from the \pkg{mosaic} package -can be used to display the results from a regression model. For this example, we might -display the predicted CESD values for a range of MCS values a 36 year old male and female subject from the parallel +can be used to display the predicted values from a regression model. For this example, we might +display the predicted CESD values for a range of MCS (mental component score) values a hypothetical 36 year old male and female subject might have from the parallel slopes (no interaction) model. <>= lmfunction <- makeFun(lmnointeract) @ -\Rindex{xyplot()}% -\Rindex{auto.key option}% +\Rindex{gf\_point()}% +\Rindex{gf\_fun()}% \Rindex{ylab option}% -\Rindex{groups option}% -\Rindex{add option}% -We can now plot this function for male and female subjects over a range of MCS (mental component score) values, along -with the observed data for 36 year olds. +\Rindex{color option}% +\Rindex{linetype option}% +\Rindex{xlim option}% +\Rindex{size option}% +We can now plot the predicted values separately for male and female subjects over a range of MCS (mental component score) values, along +with the observed data for all of the 36 year olds. <>= -xyplot(cesd ~ mcs, groups=sex, auto.key=TRUE, - data=filter(HELPrct, age==36)) -plotFun(lmfunction(mcs, age=36, sex="male") ~ mcs, - xlim=c(0, 60), lwd=2, ylab="predicted CESD", add=TRUE) -plotFun(lmfunction(mcs, age=36, sex="female") ~ mcs, - xlim=c(0, 60), lty=2, lwd=3, add=TRUE) +gf_point(cesd ~ mcs, color = ~ sex, + data = filter(HELPrct, age == 36), + ylab = "predicted CESD") %>% + gf_fun(lmfunction(mcs, age = 36, sex = "male") ~ mcs, + xlim = c(0, 60), size = 1.5) %>% + gf_fun(lmfunction(mcs, age = 36, sex = "female") ~ mcs, + xlim = c(0, 60), linetype = 2, size = 2) @ @@ -99,14 +108,12 @@ confidence intervals). \Rindex{mplot()}% <>= -mplot(lmnointeract, rows=-1, which=7) +mplot(lmnointeract, rows = -1, which = 7) @ -\TeachingTip[-4cm]{Darker dots indicate regression coefficients where the 95\% confidence interval does not include the null hypothesis value of zero.} - -\Caution[-2cm\{Be careful when fitting regression models with missing values (see also section \ref{sec:miss}).} +\FoodForThought[-4cm]{Darker dots indicate regression coefficients where the 95\% confidence interval does not include the null hypothesis value of zero.} -\newpage +\Caution{Be careful when fitting regression models with missing values (see also section \ref{sec:miss}).} \subsection{Residual diagnostics} \myindex{residual diagnostics} @@ -114,18 +121,20 @@ mplot(lmnointeract, rows=-1, which=7) It's straightforward to undertake residual diagnostics for this model. We begin by adding the fitted values and residuals to the dataset. -\TeachingTip[-1cm]{The \function{mplot} function can also be used to create these graphs.} +\FoodForThought[-1cm]{The \function{mplot} function can also be used to create these graphs.} \Rindex{resid()}% \Rindex{fitted()}% -\Rindex{abs()}% +\Rindex{gf\_dhistogram()}% +\Rindex{gf\_fitdistr()}% \InstructorNote{Here we are adding two new variables into an existing dataset. It's often a good practice to give the resulting dataframe a new name.} <>= -HELPrct <- mutate(HELPrct, residuals = resid(lmnointeract), +HELPrct <- mutate(HELPrct, + residuals = resid(lmnointeract), pred = fitted(lmnointeract)) @ <>= -histogram(~ residuals, xlab="residuals", fit="normal", - data=HELPrct) +gf_dhistogram(~ residuals, data = HELPrct) %>% + gf_fitdistr(dist = "dnorm") @ We can identify the subset of observations with extremely large residuals. @@ -136,16 +145,18 @@ filter(HELPrct, abs(residuals) > 25) @ \Rindex{cex option}% -\Rindex{type option}% -<>= -xyplot(residuals ~ pred, ylab="residuals", cex=0.3, - xlab="predicted values", main="predicted vs. residuals", - type=c("p", "r", "smooth"), data=HELPrct) +<>= +gf_point(residuals ~ pred, cex = .3, xlab = "predicted values", + title = "predicted vs. residuals", data = HELPrct) %>% + gf_smooth(se = FALSE) %>% + gf_hline(yintercept = 0) @ -<>= -xyplot(residuals ~ mcs, xlab="mental component score", - ylab="residuals", cex=0.3, - type=c("p", "r", "smooth"), data=HELPrct) +<>= +gf_point(residuals ~ mcs, cex = .3, + xlab = "mental component score", + title = "predicted vs. residuals", data = HELPrct) %>% + gf_smooth(se = FALSE) %>% + gf_hline(yintercept = 0) @ The assumptions of normality, linearity and homoscedasticity seem reasonable here. @@ -157,9 +168,9 @@ how it changes as a function of temperature and day of the week. Describe the distribution of the variable \variable{avgtemp} in terms of its center, spread and shape. <>= -favstats(~ avgtemp, data=RailTrail) -densityplot(~ avgtemp, xlab="Average daily temp (degrees F)", - data=RailTrail) +favstats(~ avgtemp, data = RailTrail) +gf_dens(~ avgtemp, xlab = "Average daily temp (degrees F)", + data = RailTrail) @ \end{problem} \begin{solution} @@ -173,8 +184,9 @@ center, spread and shape. \end{problem} \begin{solution} <<>>= -favstats(~ cloudcover, data=RailTrail) -densityplot(~ cloudcover, data=RailTrail) +favstats(~ cloudcover, data = RailTrail) +gf_dens(~ cloudcover, data = RailTrail) + + xlim(-5, 15) @ The distribution of cloud cover is ungainly (almost triangular), with increasing probability for more cloudcover. The mean is 5.8 oktas (out of 10), with standard deviation of 3.2 oktas. It tends to be @@ -188,8 +200,9 @@ center, spread and shape. \end{problem} \begin{solution} <<>>= -favstats(~ volume, data=RailTrail) -densityplot(~ volume, xlab="# of crossings", data=RailTrail) +favstats(~ volume, data = RailTrail) +gf_dens(~ volume, xlab = "# of crossings", data = RailTrail) + + xlim(0, 900) filter(RailTrail, volume > 700) @ The distribution of daily crossings is approximately normally @@ -205,24 +218,24 @@ What percentage of the days are weekends/holidays? \end{problem} \begin{solution} <<>>= -tally(~ weekday, data=RailTrail) -tally(~ weekday, format="percent", data=RailTrail) +tally(~ weekday, data = RailTrail) +tally(~ weekday, format = "percent", data = RailTrail) @ Just over 30\% of the days are weekends or holidays. \end{solution} \begin{problem} Use side-by-side boxplots to compare the distribution of \variable{volume} by day type in the \dataframe{RailTrail} dataset. -Hint: you'll need to turn the numeric \variable{weekday} variable into a factor variable using \function{as.factor()}. +Hint: you'll need to turn the numeric \variable{weekday} variable into a factor variable using \function{as.factor()} or use the {\tt horizontal=FALSE} option. What do you conclude? \end{problem} \begin{solution} <<>>= -bwplot(volume ~ as.factor(weekday), data=RailTrail) +gf_boxplot(volume ~ as.factor(weekday), data = RailTrail) @ or <<>>= -RailTrail <- mutate(RailTrail, daytype = ifelse(weekday==1, "weekday", "weekend/holiday")) -bwplot(volume ~ daytype, data=RailTrail) +RailTrail <- mutate(RailTrail, daytype = ifelse(weekday == 1, "weekday", "weekend/holiday")) +gf_boxplot(volume ~ daytype, data = RailTrail) @ We see that the weekend/holidays tend to have more users. \end{solution} @@ -234,7 +247,7 @@ What do you conclude? \end{problem} \begin{solution} <<>>= -densityplot(volume ~ weekday, auto.key=TRUE, data=RailTrail) +gf_density(~ volume, color = ~ as.factor(weekday), fill = ~ as.factor(weekday), data = RailTrail) @ We see that the weekend/holidays tend to have more users. \end{solution} @@ -244,8 +257,10 @@ smoother (lowess curve). What do you observe about the relationship? \end{problem} \begin{solution} <<>>= -xyplot(volume ~ avgtemp, xlab="average temperature (degrees F)", - type=c("p", "r", "smooth"), lwd=2, data=RailTrail) +mod2 <- lm(volume ~ avgtemp, data = RailTrail) +gf_point(volume ~ avgtemp, xlab = "average temperature (degrees F)", data = RailTrail) %>% + gf_smooth(size = 2, se = FALSE) %>% + gf_fun(mod2, size = 2) @ We see that there is a positive relationship between these two variables, but the association is somewhat nonlinear (which makes sense as we wouldn't continue to predict an increase in usage when the @@ -260,7 +275,7 @@ Is there evidence to retain the interaction term at the $\alpha=0.05$ level? \end{problem} \begin{solution} <<>>= -fm <- lm(volume ~ cloudcover + avgtemp + weekday + avgtemp:weekday, data=RailTrail) +fm <- lm(volume ~ cloudcover + avgtemp + weekday + avgtemp:weekday, data = RailTrail) summary(fm) @ The interaction between average temperature and day-type is statistically significant (p=0.016). We @@ -278,21 +293,20 @@ coef(fm) \begin{solution} <<>>= myfun <- makeFun(fm) -myfun(cloudcover=0, avgtemp=60, weekday=1) +myfun(cloudcover = 0, avgtemp = 60, weekday = 1) @ We expect just over 480 crossings on a day with these characteristics. \end{solution} \begin{problem} -Use \function{makeFun()} and \function{plotFun()} to display predicted values for the number of crossings +Use \function{makeFun()} and \function{gf\_fun()} to display predicted values for the number of crossings on weekdays and weekends/holidays for average temperatures between 30 and 80 degrees and a cloudy day (\variable{cloudcover=10}). \end{problem} \begin{solution} <<>>= -myfun <- makeFun(fm) -xyplot(volume ~ avgtemp, data=RailTrail) -plotFun(myfun(cloudcover=10, avgtemp, weekday=0) ~ avgtemp, lwd=2, add=TRUE) -plotFun(myfun(cloudcover=10, avgtemp, weekday=1) ~ avgtemp, lty=2, lwd=3, add=TRUE) +gf_point(volume ~ avgtemp, data = RailTrail) %>% + gf_fun(myfun(cloudcover=10, avgtemp, weekday=0) ~ avgtemp, size = 1.5) %>% + gf_fun(myfun(cloudcover=10, avgtemp, weekday=1) ~ avgtemp, linetype = 2, size = 2) @ We interpret this as being a steeper slope (stronger association) on weekdays rather than weekends. @@ -304,7 +318,8 @@ density) to assess the normality of the residuals. \end{problem} \begin{solution} <<>>= -histogram(~ resid(fm), fit="normal") +gf_dhistogram(~resid(fm), bins = 7) %>% + gf_fitdistr(dist = "dnorm") @ The distribution is approximately normal. \end{solution} @@ -314,7 +329,9 @@ on the linearity of the model and assumption of equal variance. \end{problem} \begin{solution} <<>>= -xyplot(resid(fm) ~ fitted(fm), type=c("p", "r", "smooth")) +gf_point(resid(fm) ~ fitted(fm)) %>% + gf_smooth(se = FALSE) %>% + gf_hline(yintercept = 0) @ The association is fairly linear, except in the tails. There's some evidence that the variability of the residuals increases with larger fitted values. \end{solution} @@ -323,7 +340,9 @@ Using the same model generate a scatterplot of the residuals versus average temp \end{problem} \begin{solution} <<>>= -xyplot(resid(fm) ~ avgtemp, type=c("p", "r", "smooth"), data=RailTrail) +gf_point(resid(fm) ~ avgtemp, data = RailTrail) %>% + gf_smooth(se = FALSE) %>% + gf_hline(yintercept = 0) @ The association is somewhat non-linear. There's some evidence that the variability of the residuals increases with larger fitted values. \end{solution} diff --git a/Compendium/OneCategorical.Rnw b/StudentGuide/OneCategorical.Rnw similarity index 65% rename from Compendium/OneCategorical.Rnw rename to StudentGuide/OneCategorical.Rnw index 8b28756..827916b 100644 --- a/Compendium/OneCategorical.Rnw +++ b/StudentGuide/OneCategorical.Rnw @@ -12,10 +12,10 @@ counts, percentages and proportions for a categorical variable. \Rindex{tally()}% \Rindex{margins option}% <>= -tally( ~ homeless, data=HELPrct) -tally( ~ homeless, margins=TRUE, data=HELPrct) -tally( ~ homeless, format="percent", data=HELPrct) -tally( ~ homeless, format="proportion", data=HELPrct) +tally(~ homeless, data = HELPrct) +tally(~ homeless, margins = TRUE, data = HELPrct) +tally(~ homeless, format = "percent", data = HELPrct) +tally(~ homeless, format = "proportion", data = HELPrct) @ \section{The binomial test} @@ -32,7 +32,7 @@ binom.test(209, 209 + 244) The \pkg{mosaic} package provides a formula interface that avoids the need to pre-tally the data. <>= -result <- binom.test( ~ (homeless=="homeless"), HELPrct) +result <- binom.test(~ (homeless == "homeless"), data = HELPrct) result @ @@ -68,22 +68,23 @@ A similar interval and test can be calculated using the function \function{prop. Here is a count of the number of people at each of the two levels of \variable{homeless} <>= -tally( ~ homeless, data=HELPrct) +tally(~ homeless, data = HELPrct) @ The \function{prop.test} function will carry out the calculations of the proportion test and report the result. -\hfill\newpage +\hfill <>= -prop.test( ~ (homeless=="homeless"), correct=FALSE, data=HELPrct) +prop.test(~ (homeless == "homeless"), correct = FALSE, + data = HELPrct) @ -In this statement, prop.test is examing the \variable{homeless} variable in the same way that \function{tally} would. \Pointer{We write \code{homeless=="homeless"} to define unambiguously which proportion we are considering. We could also have written \code{homeless=="housed"}. } -\function{prop.test} can also work directly with numerical counts, the way \function{binom.test()} does. +In this statement, prop.test is examining the \variable{homeless} variable in the same way that \function{tally} would. \Pointer{We write \code{homeless == "homeless"} to define unambiguously which proportion we are considering. We could also have written \code{homeless == "housed"}. } +The \function{prop.test} function can also work directly with numerical counts, the way \function{binom.test()} does. \InstructorNote{\function{prop.test()} calculates a Chi-squared statistic. Most introductory texts use a $z$-statistic. They are mathematically equivalent in terms of inferential statements, but you may need to address the discrepancy with your students.}% <<>>= -prop.test(209, 209 + 244, correct=FALSE) +prop.test(209, 209 + 244, correct = FALSE) @ \section{Goodness of fit tests} @@ -92,38 +93,49 @@ A variety of goodness of fit tests can be calculated against a reference distri <>= -tally( ~ substance, format="percent", data=HELPrct) -observed <- tally( ~ substance, data=HELPrct) +tally(~ substance, format = "percent", data = HELPrct) +observed <- tally(~ substance, data = HELPrct) observed @ -\Caution[-1cm]{In addition to the \option{format} option, there is an option \option{margins} to include marginal totals in the table. The default in \function{tally} is \option{margins=FALSE}. Try it out!} +\Caution[-1cm]{In addition to the \option{format} option, there is an option \option{margins} to include marginal totals in the table. The default in \function{tally} is \option{margins = FALSE}. Try it out!} \Rindex{chisq.test()}% <>= p <- c(1/3, 1/3, 1/3) # equivalent to rep(1/3, 3) -chisq.test(observed, p=p) -total <- sum(observed); total -expected <- total*p; expected +chisq.test(observed, p = p) +total <- sum(observed) +total +expected <- total*p +expected @ We can also calculate the $\chi^2$ statistic manually, as a function of observed and expected values. -\TeachingTip[-1cm]{We don't have students do much if any manual calculations in our courses.}% \Rindex{sum()}% \Rindex{pchisq()}% <>= -chisq <- sum((observed - expected)^2/(expected)); chisq -1 - pchisq(chisq, df=2) +chisq <- sum((observed - expected)^2/(expected)) +chisq +1 - pchisq(chisq, df = 2) @ -\TeachingTip[-2cm]{The \function{pchisq} function calculates the probability that a $\chi^2$ random variable with \function{df} degrees is freedom is less than or equal to a given value. Here we calculate the complement to find the area to the right of the observed Chi-square statistic.}% +\FoodForThought[-2cm]{The \function{pchisq} function calculates the probability that a $\chi^2$ random variable with \function{df} degrees is freedom is less than or equal to a given value. Here we calculate the complement to find the area to the right of the observed Chi-square statistic.}% + +It may be helpful to consult a graph of the statistic, where the shaded area represents the value to the right of the observed value. + +\Rindex{gf\_dist()}% +<>= +gf_dist("chisq", df = 2, fill = ~ (x > 9.31), geom = "area") +@ + Alternatively, the \pkg{mosaic} package provides a version of \function{chisq.test()} with more verbose output. +\Rindex{xchisq.test()}% <<>>= -xchisq.test(observed, p=p) +xchisq.test(observed, p = p) @ \FoodForThought[-1.5cm]{\code{x} in \function{xchisq.test} stands for eXtra.} -\TeachingTip{Objects in the workspace are listed in the {\sc Environment} tab in \RStudio. If you want to clean up that listing, remove objects that are no longer needed with \function{rm}.} +\FoodForThought{Objects in the workspace are listed in the {\sc Environment} tab in \RStudio. If you want to clean up that listing, remove objects that are no longer needed with \function{rm}.} <<>>= # clean up variables no longer needed rm(observed, p, total, chisq) diff --git a/Compendium/OneQuantitative.Rnw b/StudentGuide/OneQuantitative.Rnw similarity index 68% rename from Compendium/OneQuantitative.Rnw rename to StudentGuide/OneQuantitative.Rnw index c780b4c..bbad2a9 100644 --- a/Compendium/OneQuantitative.Rnw +++ b/StudentGuide/OneQuantitative.Rnw @@ -14,13 +14,12 @@ level (see \function{?options} for more configuration possibilities). \myindex{HELPrct dataset}% \Rindex{options()}% -\Rindex{require()}% +\Rindex{library()}% \Rindex{mosaic package}% <>= -require(mosaic) -require(mosaicData) -options(digits=3) -mean( ~ cesd, data=HELPrct) +library(mosaic) +options(digits = 4) +mean(~ cesd, data = HELPrct) @ \myindex{Start Teaching with R@\emph{Start Teaching with R}}% @@ -41,11 +40,11 @@ mean(HELPrct$cesd) \Rindex{var()}% Similar functionality exists for other summary statistics. <>= -sd( ~ cesd, data=HELPrct) +sd(~ cesd, data = HELPrct) @ <>= -sd( ~ cesd, data=HELPrct)^2 -var( ~ cesd, data=HELPrct) +sd(~ cesd, data = HELPrct)^2 +var(~ cesd, data = HELPrct) @ It is also straightforward to calculate quantiles of the distribution. @@ -53,7 +52,7 @@ It is also straightforward to calculate quantiles of the distribution. \myindex{quantiles}% \Rindex{median()}% <>= -median( ~ cesd, data=HELPrct) +median(~ cesd, data = HELPrct) @ By default, the \function{quantile()} function displays the quartiles, but can be given a vector of quantiles to display. @@ -65,22 +64,30 @@ with(HELPrct, quantile(cesd)) with(HELPrct, quantile(cesd, c(.025, .975))) @ -\hfill\newpage +\hfill \Rindex{favstats()}% Finally, the \function{favstats()} function in the \pkg{mosaic} package provides a concise summary of many useful statistics. <<>>= -favstats( ~ cesd, data=HELPrct) +favstats(~ cesd, data = HELPrct) @ \section{Graphical summaries} -The \function{histogram()} function is used to create a histogram. Here we use the formula interface (as discussed in the \emph{Start Modeling with R} book) to specify that we want a histogram of the CESD scores. +The \function{gf\_histogram()} function is used to create a histogram. Here we use the formula interface (as discussed in the \emph{Start Modeling with R} book) to specify that we want a histogram of the CESD scores. -\Rindex{histogram()}% +\Rindex{gf\_histogram()}% \vspace{-4mm} \begin{center} <>= -histogram( ~ cesd, data=HELPrct) +gf_histogram(~ cesd, data = HELPrct, binwidth = 5.9) +@ +\end{center} + +We can use the \function{binwidth} and \function{center} options to control the location of the bins. + +\begin{center} +<>= +gf_histogram(~ cesd, data = HELPrct, binwidth = 5, center = 2.5) @ \end{center} @@ -89,8 +96,8 @@ histogram( ~ cesd, data=HELPrct) \Rindex{format option}% In the \variable{HELPrct} dataset, approximately one quarter of the subjects are female. <<>>= -tally( ~ sex, data=HELPrct) -tally( ~ sex, format="percent", data=HELPrct) +tally(~ sex, data = HELPrct) +tally(~ sex, format = "percent", data = HELPrct) @ It is straightforward to restrict our attention to just the female subjects. @@ -104,32 +111,37 @@ the \function{stem()} function is used to create a stem and leaf plot. \Rindex{filter()}% \Rindex{dplyr package}% <>= -female <- filter(HELPrct, sex=='female') -male <- filter(HELPrct, sex=='male') -with(female, stem(cesd)) +Female <- filter(HELPrct, sex == 'female') +Male <- filter(HELPrct, sex == 'male') +with(Female, stem(cesd)) @ \Rindex{dplyr package}% \Rindex{tidyr package}% -Subsets can also be generated and used ``on the fly" (this time including +Subsets can also be generated and used "on the fly" (this time including an overlaid normal density): -\Rindex{fit option}% +\Rindex{gf\_fitdistr()}% +\Rindex{gf\_dhistogram}% +\Rindex{dist option}% <>= -histogram( ~ cesd, fit="normal", - data=filter(HELPrct, sex=='female')) +gf_dhistogram(~ cesd, data = filter(HELPrct, sex == "female"), + binwidth = 7.1) %>% + gf_fitdistr(dist = "dnorm") @ Alternatively, we can make side-by-side plots to compare multiple subsets. +\Rindex{gf\_facet\_wrap}% <>= -histogram( ~ cesd | sex, data=HELPrct) +gf_dhistogram(~ cesd, data = HELPrct, binwidth = 5.9) %>% + gf_facet_wrap(~ sex) @ The layout can be rearranged. -\Rindex{layout option}% \begin{center} <>= -histogram( ~ cesd | sex, layout=c(1, 2), data=HELPrct) +gf_dhistogram(~ cesd, data = HELPrct, binwidth = 5.9) %>% + gf_facet_wrap(~ sex, nrow = 2) @ \end{center} \begin{problem} @@ -139,33 +151,36 @@ group, just for the male subjects, with an overlaid normal density. \end{problem}% \begin{solution} <>= -histogram( ~ cesd | substance, fit="normal", - data=filter(HELPrct, sex=='male')) +gf_dhistogram(~ cesd | substance, binwidth = 5, + data=filter(HELPrct, sex == "male")) %>% + gf_fitdistr(dist = "dnorm") @ \end{solution}% We can control the number of bins in a number of ways. These can be specified as the total number. -\Rindex{nint option}% +\Rindex{bins option}% \begin{center} <>= -histogram( ~ cesd, nint=20, data=female) +gf_dhistogram(~ cesd, bins = 20, data = Female) @ \end{center} The width of the bins can be specified. -\Rindex{width option}% +\Rindex{binwidth option}% \begin{center} <>= -histogram( ~ cesd, width=1, data=female) +gf_dhistogram(~ cesd, binwidth = 2, data = Female) @ \end{center} -The \function{dotPlot()} function is used to create a dotplot +The \function{gf_dotplot()} function is used to create a dotplot for a smaller subset of subjects (homeless females). We also demonstrate how to change the x-axis label. -\Rindex{dotPlot()}% +\Rindex{gf\_dotplot()}% +\Rindex{gf\_labs()}% <>= -dotPlot( ~ cesd, xlab="CESD score", - data=filter(HELPrct, (sex=="female") & (homeless=="homeless"))) +gf_dotplot(~ cesd, binwidth = 3, + data = filter(HELPrct, sex == "female", homeless == "homeless")) %>% + gf_labs(x = "CESD score") @ @@ -173,24 +188,25 @@ dotPlot( ~ cesd, xlab="CESD score", \FoodForThought{Density plots are also sensitive to certain choices. If your density plot is too jagged or too smooth, try changing the \option{adjust} argument: larger than 1 for smoother plots, less than 1 for more jagged plots.} One disadvantage of histograms is that they can be sensitive to the choice of the number of bins. Another display to consider is a density curve. -Here we adorn a density plot with some gratuitous additions to demonstrate how to build up a graphic for pedagogical purposes. We add some text, a superimposed normal density as well as a vertical line. A variety of line types and colors can be specified, as well as line widths. +Here we adorn a density plot with some additions to demonstrate how to build up a graphic for pedagogical purposes. We add some text, a superimposed normal density as well as a vertical line. A variety of line types and colors can be specified, as well as line widths. \DiggingDeeper{The \function{plotFun()} function can also be used to annotate plots (see section \ref{sec:plotFun}).} \begin{center} -\Rindex{densityplot()}% -\Rindex{ladd()}% -\Rindex{panel.mathdensity()}% -\Rindex{panel.abline()}% -\Rindex{col option}% -\Rindex{grid.text()}% -\Rindex{lty option}% -\Rindex{lwd option}% +\Rindex{gf\_dens()}% +\Rindex{gf\_refine()}% +\Rindex{gf\_vline()}% +\Rindex{annotate()}% +\Rindex{gf\_fitdistr()}% +\Rindex{geom option}% +\Rindex{dist option}% +\Rindex{xintercept option}% <>= -densityplot( ~ cesd, data=female) -ladd(grid.text(x=0.2, y=0.8, 'only females')) -ladd(panel.mathdensity(args=list(mean=mean(cesd), - sd=sd(cesd)), col="red"), data=female) -ladd(panel.abline(v=60, lty=2, lwd=2, col="grey")) +gf_dens(~ cesd, data = Female) %>% + gf_refine(annotate(geom = "text", x = 10, y = .025, + label = "only females")) %>% + gf_fitdistr(dist = "dnorm") %>% + gf_vline(xintercept = 60) + + xlim(0, 80) @ \end{center} @@ -198,10 +214,10 @@ ladd(panel.abline(v=60, lty=2, lwd=2, col="grey")) \myindex{polygons}% A third option is a frequency polygon, where the graph is created by joining the midpoints of the top of the bars of a histogram. -\Rindex{freqpolygon()}% +\Rindex{gf\_freqpoly()}% \begin{center} <>= -freqpolygon( ~ cesd, data=female) + gf_freqpoly(~ cesd, data = Female, binwidth = 3.8) @ \end{center} @@ -211,7 +227,7 @@ freqpolygon( ~ cesd, data=female) The most famous density curve is a normal distribution. The \function{xpnorm()} function displays the probability that a random variable is less than the first argument, for a normal distribution with mean given by the second argument and standard deviation by the third. More information about probability distributions can be found in section \ref{sec:probability}. <>= -xpnorm(1.96, mean=0, sd=1) +xpnorm(1.96, mean = 0, sd = 1) @ @@ -224,8 +240,8 @@ xpnorm(1.96, mean=0, sd=1) We can calculate a 95\% confidence interval for the mean CESD score for females by using a t-test: <>= -t.test( ~ cesd, data=female) -confint(t.test( ~ cesd, data=female)) +t.test(~ cesd, data = Female) +confint(t.test(~ cesd, data = Female)) @ \DiggingDeeper{More details and examples can be found in the @@ -235,23 +251,23 @@ confint(t.test( ~ cesd, data=female)) But it's also straightforward to calculate this using a bootstrap. The statistic that we want to resample is the mean. <>= -mean( ~ cesd, data=female) +mean(~ cesd, data = Female) @ One resampling trial can be carried out: -\TeachingTip{Here we sample with replacement from the original dataframe, +\FoodForThought{Here we sample with replacement from the original dataframe, creating a resampled dataframe with the same number of rows.} \Rindex{resample()}% <>= -mean( ~ cesd, data=resample(female)) +mean(~ cesd, data = resample(Female)) @ -\TeachingTip{Even though a single trial is of little use, it's smart having +\FoodForThought{Even though a single trial is of little use, it's smart having students do the calculation to show that they are (usually!) getting a different result than without resampling.} Another will yield different results: <<>>= -mean( ~ cesd, data=resample(female)) +mean(~ cesd, data = resample(Female)) @ Now conduct 1000 resampling trials, saving the results in an object @@ -259,6 +275,7 @@ called \texttt{trials}: \Rindex{do()}% \Rindex{qdata()}% <>= -trials <- do(1000) * mean( ~ cesd, data=resample(female)) -qdata(c(.025, .975), ~ result, data=trials) +trials <- do(1000) * mean(~ cesd, data = resample(Female)) +head(trials, 3) +qdata(~ mean, c(.025, .975), data = trials) @ diff --git a/Compendium/Power.Rnw b/StudentGuide/Power.Rnw similarity index 80% rename from Compendium/Power.Rnw rename to StudentGuide/Power.Rnw index 5bc7e67..2a615c0 100644 --- a/Compendium/Power.Rnw +++ b/StudentGuide/Power.Rnw @@ -19,9 +19,9 @@ a positive value. Let's consider values between 15 and 19. <>= xvals <- 15:19 -probs <- 1 - pbinom(xvals, size=25, prob=0.5) +probs <- 1 - pbinom(xvals, size = 25, prob = 0.5) cbind(xvals, probs) -qbinom(.95, size=25, prob=0.5) +qbinom(.95, size = 25, prob = 0.5) @ So we see that if we decide to reject when the number of positive values is 17 or larger, we will have an $\alpha$ level of \Sexpr{round(1-pbinom(16, 25, 0.5), 3)}, @@ -29,12 +29,12 @@ which is near the nominal value in the problem. We calculate the power of the sign test as follows. The probability that $X_i > 0$, given that $H_A$ is true is given by: <>= -1 - pnorm(0, mean=0.3, sd=1) +1 - pnorm(0, mean = 0.3, sd = 1) @ We can view this graphically using the command: \begin{center} <>= -xpnorm(0, mean=0.3, sd=1, lower.tail=FALSE) +xpnorm(0, mean = 0.3, sd = 1, lower.tail = FALSE) @ \end{center} The power under the alternative is equal to the probability of getting 17 or more positive values, @@ -42,7 +42,7 @@ given that $p=0.6179$: \Rindex{pbinom()}% <>= -1 - pbinom(16, size=25, prob=0.6179) +1 - pbinom(16, size = 25, prob = 0.6179) @ The power is modest at best. @@ -52,16 +52,20 @@ We next calculate the power of the test based on normal theory. To keep the com fair, we will set our $\alpha$ level equal to 0.05388. <>= -alpha <- 1-pbinom(16, size=25, prob=0.5); alpha +alpha <- 1 - pbinom(16, size = 25, prob = 0.5) +alpha @ First we find the rejection region. +\Rindex{qnorm()}% +\Rindex{xqnorm()}% <>= -n <- 25; sigma <- 1 # given +n <- 25 +sigma <- 1 # given stderr <- sigma/sqrt(n) -zstar <- qnorm(1-alpha, mean=0, sd=1) +zstar <- xqnorm(1 - alpha, mean = 0, sd = 1) zstar -crit <- zstar*stderr +crit <- zstar * stderr crit @ @@ -74,23 +78,25 @@ under the alternative hypothesis to the right of this cutoff. <<>>= -power <- 1 - pnorm(crit, mean=0.3, sd=stderr) +power <- 1 - pnorm(crit, mean = 0.3, sd = stderr) power @ The power of the test based on normal theory is \Sexpr{round(power,3)}. To provide a check (or for future calculations of this sort) we can use the \function{power.t.test()} function. +\Rindex{power.t.test()}% <<>>= -power.t.test(n=25, delta=.3, sd=1, sig.level=alpha, alternative="one.sided", -type="one.sample")$power +power.t.test(n = 25, delta = .3, sd = 1, sig.level = alpha, + alternative = "one.sided", +type = "one.sample")$power @ This analytic (formula-based approach) yields a similar estimate to the value that we calculated directly. Overall, we see that the t-test has higher power than the sign test, if the underlying -data are truly normal. \TeachingTip{It's useful to have students calculate power empirically, -to demonstrate the power of simulations.} +data are truly normal. \FoodForThought{Calculating power empirically +demonstrates the power of simulations.} \begin{problem} \label{prob:power1}% Find the power of a two-sided two-sample t-test where both distributions @@ -106,7 +112,7 @@ alpha <- 0.01 <>= n alpha -power.t.test(n=n, delta=.5, sd=1, sig.level=alpha) +power.t.test(n = n, delta = .5, sd = 1, sig.level = alpha) @ \end{solution} \begin{problem} @@ -117,7 +123,7 @@ difference between means is 25\% of the standard deviation in the groups \end{problem} \begin{solution} <>= -power.t.test(delta=.25, sd=1, sig.level=alpha, power=0.90) +power.t.test(delta = .25, sd = 1, sig.level = alpha, power = 0.90) @ \end{solution} diff --git a/Compendium/ProbabilityDistributions.Rnw b/StudentGuide/ProbabilityDistributions.Rnw similarity index 57% rename from Compendium/ProbabilityDistributions.Rnw rename to StudentGuide/ProbabilityDistributions.Rnw index c3f7d8d..6b38f43 100644 --- a/Compendium/ProbabilityDistributions.Rnw +++ b/StudentGuide/ProbabilityDistributions.Rnw @@ -9,7 +9,7 @@ It is straightforward to generate random samples from these distributions, which can be used for simulation and exploration. <>= -xpnorm(1.96, mean=0, sd=1) # P(Z < 1.96) +xpnorm(1.96, mean = 0, sd = 1) # P(Z < 1.96) @ \Rindex{qnorm()}% \Rindex{dnorm()}% @@ -19,9 +19,16 @@ xpnorm(1.96, mean=0, sd=1) # P(Z < 1.96) \Rindex{integrate()}% <<>>= # value which satisfies P(Z < z) = 0.975 -qnorm(.975, mean=0, sd=1) +qnorm(.975, mean = 0, sd = 1) integrate(dnorm, -Inf, 0) # P(Z < 0) @ + +A similar display is available for the F distribution. + +<<>>= +xpf(3, df1 = 4, df2 = 20) +@ + The following table displays the basenames for probability distributions available within base \R. These functions can be prefixed by {\tt d} to find the density function for the distribution, {\tt p} to find the @@ -52,17 +59,56 @@ Uniform & {\tt unif} \\ Weibull & {\tt weibull} \\ \hline \end{tabular} \end{center} -\DiggingDeeper{The \function{fitdistr()} within the \pkg{MASS} package facilitates estimation +\DiggingDeeper{The \function{gf\_fitdistr()} facilitates estimation of parameters for many distributions.} -The \function{plotDist()} can be used to display distributions in a variety of ways. +The \function{gf\_dist()} can be used to display distributions in a variety of ways. +\Rindex{gf\_dist()}% <>= -plotDist('norm', mean=100, sd=10, kind='cdf') +gf_dist('norm', mean = 100, sd = 10, kind = 'cdf') @ <>= -plotDist('exp', kind='histogram', xlab="x") +gf_dist('exp', kind = 'histogram', xlab = "x") @ -<>= -plotDist('binom', size=25, prob=0.25, xlim=c(-1,26)) +Note that this sets the rate parameter to 1 by default and is equivalent to the following command. +<>= +gf_dist('exp', rate = 1, kind = 'histogram', xlab = "x") +@ + +<>= +gf_dist('binom', size = 25, prob = 0.25, xlim = c(-1, 26)) +@ + +Multiple distributions can be plotted on the same plot. +\Rindex{fill option}% +\Rindex{cut()}% +\Rindex{gf\_labs()}% +\Rindex{geom option}% +\Rindex{pch option}% +<>= +gf_dist("norm", mean = 50 * .3, sd = sqrt(50 * .3 * .7), + fill = ~ cut(x, c(-Inf, 15 - 3, 15 + 3, Inf)), geom = "area") %>% + gf_dist("binom", size = 50, prob = .3, col = "black", + pch = 16) %>% + gf_labs(fill = "Intervals") +@ + +The \function{gf\_fun()} function can be used to plot an arbitrary +function (in this case an exponential random variable). + +\Rindex{makeFun()}% +\Rindex{rexp()}% +\Rindex{gf\_histogram()}% +\Rindex{binwidth option}% +\Rindex{center option}% +\Rindex{gf\_fun()}% +\Rindex{color option}% +\Rindex{size option}% +\Rindex{xlim option}% +<>= +f <- makeFun(2 * exp(-2 * x) ~ x) # exponential with rate parameter 2 +x <- rexp(1000, rate = 2) +gf_dhistogram(~ x, binwidth = 0.2, center = 0.1) %>% + gf_fun(f(x) ~ x, color = "red", size = 2, xlim = c(0, 3)) @ \begin{problem} Generate a sample of 1000 exponential random variables with rate parameter @@ -70,7 +116,7 @@ equal to 2, and calculate the mean of those variables. \end{problem} \begin{solution} <>= -x <- rexp(1000, rate=2) +x <- rexp(1000, rate = 2) mean(x) @ \end{solution} @@ -81,6 +127,6 @@ with rate parameter 10. \end{problem} \begin{solution} <>= -qexp(.5, rate=10) +qexp(.5, rate = 10) @ \end{solution} diff --git a/Compendium/QuantitativeResponse.Rnw b/StudentGuide/QuantitativeResponse.Rnw similarity index 62% rename from Compendium/QuantitativeResponse.Rnw rename to StudentGuide/QuantitativeResponse.Rnw index eb7e666..35394ee 100644 --- a/Compendium/QuantitativeResponse.Rnw +++ b/StudentGuide/QuantitativeResponse.Rnw @@ -1,74 +1,87 @@ \section{A dichotomous predictor: numerical and graphical summaries} Here we will compare the distributions of CESD scores by sex. - The \function{mean()} function can be used to calculate the mean CESD score separately for males and females. +\Rindex{mean()}% <>= -mean(cesd ~ sex, data=HELPrct) +mean(cesd ~ sex, data = HELPrct) @ - +\Rindex{favstats()}% The \function{favstats()} function can provide more statistics by group. + <>= -favstats(cesd ~ sex, data=HELPrct) +favstats(cesd ~ sex, data = HELPrct) @ Boxplots are a particularly helpful graphical display to compare distributions. -The \function{bwplot()} function can be used to display the boxplots for the +The \function{gf\_boxplot()} function can be used to display the boxplots for the CESD scores separately by sex. We see from both the numerical and graphical summaries that women tend to have slightly higher CESD scores than men. \FoodForThought[-3cm]{Although we usually put explanatory variables along the horizontal axis, page layout sometimes makes the other orientation preferable for these plots.} %\vspace{-8mm} +\Rindex{gf\_boxplot()}% +\Rindex{gf\_refine()}% \begin{center} <>= -bwplot(sex ~ cesd, data=HELPrct) +gf_boxplot(cesd ~ sex, data = HELPrct) %>% +gf_refine(coord_flip()) @ \end{center} When sample sizes are small, there is no reason to summarize with a boxplot -since \function{xyplot()} can handle categorical predictors. +since \function{gf\_point()} can handle categorical predictors. Even with 10--20 observations in a group, a scatter plot is often quite readable. Setting the alpha level helps detect multiple observations with the same value. \FoodForThought{One of us once saw a biologist proudly present side-by-side boxplots. Thinking a major victory had been won, he naively asked how many observations were in each group. ``Four,'' replied the biologist.} +\Rindex{gf\_point()}% +\Rindex{alpha option}% +\Rindex{cex option}% \begin{center} <>= -xyplot(sex ~ length, KidsFeet, alpha=.6, cex=1.4) +gf_point(sex ~ length, alpha = .6, cex = 1.4, + data = KidsFeet) @ \end{center} \section{A dichotomous predictor: two-sample t} The Student's two sample t-test can be run without (default) or with an equal variance assumption. +\Rindex{t.test()}% +\Rindex{var.equal option}% <>= -t.test(cesd ~ sex, var.equal=FALSE, data=HELPrct) +t.test(cesd ~ sex, var.equal = FALSE, data = HELPrct) @ We see that there is a statistically significant difference between the two groups. We can repeat using the equal variance assumption. <>= -t.test(cesd ~ sex, var.equal=TRUE, data=HELPrct) +t.test(cesd ~ sex, var.equal = TRUE, data = HELPrct) @ -The groups can also be compared using the \function{lm()} function (also with an equal variance assumption). +The groups can also be compared using the \function{lm()} function (also with an equal variance assumption). The mosaic command \function{msummary()} provides a slightly terser version of the typical output from \function{summary()}. +\Rindex{msummary()}% +\Rindex{summary()}% <<>>= -summary(lm(cesd ~ sex, data=HELPrct)) +msummary(lm(cesd ~ sex, data = HELPrct)) @ -\TeachingTip[1cm]{The \function{lm} function is part of a much more flexible modeling framework while \function{t.test} is essentially a dead end. \function{lm} uses of the equal variance assumption. See the companion book, {\em Start Modeling in R} for more details.}% +\FoodForThought[1cm]{The \function{lm} function is part of a much more flexible modeling framework while \function{t.test} is essentially a dead end. \function{lm} uses of the equal variance assumption. See the companion book, {\em Start Modeling in R} for more details.}% \section{Non-parametric 2 group tests} The same conclusion is reached using a non-parametric (Wilcoxon rank sum) test. +\Rindex{wilcox.test()}% <>= -wilcox.test(cesd ~ sex, data=HELPrct) +wilcox.test(cesd ~ sex, data = HELPrct) @ @@ -82,29 +95,35 @@ undertake a two-sided test comparing the ages at baseline by gender. First we c \Rindex{diffmean()}% \Rindex{shuffle()}% <<>>= -mean(age ~ sex, data=HELPrct) -test.stat <- diffmean(age ~ sex, data=HELPrct) +mean(age ~ sex, data = HELPrct) +test.stat <- diffmean(age ~ sex, data = HELPrct) test.stat @ We can calculate the same statistic after shuffling the group labels: <<>>= -do(1) * diffmean(age ~ shuffle(sex), data=HELPrct) -do(1) * diffmean(age ~ shuffle(sex), data=HELPrct) -do(3) * diffmean(age ~ shuffle(sex), data=HELPrct) +do(1) * diffmean(age ~ shuffle(sex), data = HELPrct) +do(1) * diffmean(age ~ shuffle(sex), data = HELPrct) +do(3) * diffmean(age ~ shuffle(sex), data = HELPrct) @ \DiggingDeeper{More details and examples can be found in the \pkg{mosaic} package Resampling Vignette.} \Rindex{xlim option}% -\Rindex{groups option}% +\Rindex{fill option}% +\Rindex{gf\_histogram()}% +\Rindex{gf\_vline()}% <>= rtest.stats <- do(500) * diffmean(age ~ shuffle(sex), - data=HELPrct) -favstats(~ diffmean, data=rtest.stats) -histogram(~ diffmean, n=40, xlim=c(-6, 6), - groups=diffmean >= test.stat, pch=16, cex=.8, - data=rtest.stats) -ladd(panel.abline(v=test.stat, lwd=3, col="red")) + data = HELPrct) +rtest.stats <- mutate(rtest.stats, + diffmeantest = + ifelse(diffmean >= test.stat, TRUE, FALSE)) +head(rtest.stats, 3) +favstats(~ diffmean, data = rtest.stats) +gf_histogram(~ diffmean, n = 40, xlim = c(-6, 6), + fill = ~ diffmeantest, pch = 16, cex = .8, + data = rtest.stats) %>% + gf_vline(xintercept = ~ test.stat, color = "red", lwd = 3) @ Here we don't see much evidence to contradict the null hypothesis that men and @@ -117,30 +136,31 @@ have the same mean age in the population. Earlier comparisons were between two groups. We can also consider testing differences between three or more groups using one-way ANOVA. Here we compare -CESD scores by primary substance of abuse (heroin, cocaine, or alcohol). +CESD scores by primary substance of abuse (heroin, cocaine, or alcohol) with a line rather a dot to indicate the median. -\Rindex{bwplot()}% +\Rindex{gf\_boxplot()}% \begin{center} <>= -bwplot(cesd ~ substance, data=HELPrct) +gf_boxplot(cesd ~ substance, data = HELPrct) @ \end{center} + <>= -mean(cesd ~ substance, data=HELPrct) +mean(cesd ~ substance, data = HELPrct) @ \Rindex{aov()}% <>= -anovamod <- aov(cesd ~ substance, data=HELPrct) +anovamod <- aov(cesd ~ substance, data = HELPrct) summary(anovamod) @ While still high (scores of 16 or more are generally considered to be ``severe'' symptoms), the cocaine-involved group tend to have lower scores than those whose primary substances are alcohol or heroin. <>= -modintercept <- lm(cesd ~ 1, data=HELPrct) -modsubstance <- lm(cesd ~ substance, data=HELPrct) +modintercept <- lm(cesd ~ 1, data = HELPrct) +modsubstance <- lm(cesd ~ substance, data = HELPrct) @ The \function{anova()} command can summarize models. @@ -149,7 +169,9 @@ The \function{anova()} command can summarize models. anova(modsubstance) @ -It can also be used to formally +In this setting the results are identical (since there is only one predictor, with 2 df). + +The \function{anova()} function can also be used to formally compare two (nested) models. \myindex{model comparison}% <>= @@ -168,7 +190,7 @@ Significant Differences (HSD). Other options are available within the \pkg{multcomp} package. <>= -favstats(cesd ~ substance, data=HELPrct) +favstats(cesd ~ substance, data = HELPrct) @ \Rindex{TukeyHSD()}% \Rindex{factor()}% @@ -178,9 +200,9 @@ favstats(cesd ~ substance, data=HELPrct) \Rindex{lm()}% <>= HELPrct <- mutate(HELPrct, subgrp = factor(substance, - levels=c("alcohol", "cocaine", "heroin"), - labels=c("A", "C", "H"))) -mod <- lm(cesd ~ subgrp, data=HELPrct) + levels = c("alcohol", "cocaine", "heroin"), + labels = c("A", "C", "H"))) +mod <- lm(cesd ~ subgrp, data = HELPrct) HELPHSD <- TukeyHSD(mod, "subgrp") HELPHSD @ diff --git a/StudentGuide/RBook.sty b/StudentGuide/RBook.sty new file mode 100644 index 0000000..018e3e9 --- /dev/null +++ b/StudentGuide/RBook.sty @@ -0,0 +1,225 @@ + +\ProvidesPackage{RBook}[2013/04/04 Mosaic R Books style] + +\makeatletter +\def\IfClass#1#2#3{\@ifundefined{opt@#1.cls}{#3}{#2}} +\makeatother +\IfClass{tufte-book}{\setcounter{secnumdepth}{3}}{\relax} +\IfClass{tufte-book}{\def\subsubsection#1{\newthought{#1}}}{} + +\RequirePackage{import} +\RequirePackage{graphicx} +\RequirePackage{alltt} +\RequirePackage{mparhack} +\RequirePackage{xstring} + +\RequirePackage{etoolbox} +\RequirePackage{multicol} +\RequirePackage{xcolor} +\RequirePackage{framed} +\RequirePackage{hyperref} +\RequirePackage{fancyhdr} + +\RequirePackage{probstat} +\RequirePackage[answerdelayed,exercisedelayed,lastexercise,chapter]{problems} +\RequirePackage{longtable} +\RequirePackage{language} + +\RequirePackage{tikz} +\usetikzlibrary{shadows} +\usetikzlibrary{decorations} +\usetikzlibrary{shapes.multipart} +\usetikzlibrary{shapes.symbols} +\usetikzlibrary{shapes.misc} +\usetikzlibrary{shapes.geometric} + +\RequirePackage[utf8]{inputenc} +\RequirePackage{underscore} + +\newcommand{\mymarginpar}[1]{% +\vadjust{\smash{\llap{\parbox[t]{\marginparwidth}{#1}\kern\marginparsep}}}} + +\newcommand{\tikzNote}[3]{% +\marginpar[% +\hspace*{0.5in} +\parbox{1.2in}{\begin{tikzpicture} +\node at (0,0) [#3] +{\parbox{1.05in}{\footnotesize {\sc #1 }{\raggedright #2}}}; +\end{tikzpicture}} +]{% +\parbox{1.2in}{\begin{tikzpicture} +\node at (0,0) [#3] +{\parbox{1.05in}{\footnotesize {\sc #1 }{\raggedright #2}}}; +\end{tikzpicture}}}\ignorespaces +}% + +\renewcommand{\tikzNote}[4][0pt]{ +\marginnote[#1]{ +\textsc{#2} + +\noindent #3 +}% +} + +%\newcommand{\InstructorNote}[3][]{% +%\tikzNote[#1]{#2}{#3}{double copy shadow={opacity=.5},tape,fill=blue!10,draw=blue,thick} +%} + +\newcommand{\InstructorNote}[2][0pt]{% + \tikzNote[#1]{}{#2}{tape,fill=blue!10,draw=blue,thick}% +} + +\newcommand{\DiggingDeeper}[2][0pt]{% +\tikzNote[#1]{\centerline{Digging Deeper}}{#2}{tape,fill=blue!10,draw=blue,thick}% +} + + +\newcommand{\TeachingTip}[2][0pt]{% +\tikzNote[#1]{\centerline{Teaching Tip}}{#2}{tape,fill=blue!10,draw=blue,thick}% +} + + +\newcommand{\FoodForThought}[2][0pt]{% +\tikzNote[#1]{}{#2}{rectangle,fill=green!10,draw=green,thick}% +} + +\newcommand{\SuggestionBox}[2][0pt]{% + \tikzNote[#1]{\centerline{Suggestion Box}}{#2}{rectangle,fill=green!10,draw=green,thick}% +} + +\newcommand{\Caution}[2][0pt]{% + \tikzNote[#1]{\centerline{Caution!}}{#2}{chamfered rectangle,fill=red!10,draw=red,thick}% +} + +\newcommand{\Pointer}[2][0pt]{% +\tikzNote[#1]{\centerline{More Info}}{#2}{}% +} + +\newcommand{\Note}[2][0pt]{% +\tikzNote[#1]{\centerline{Note}}{#2}{}% +} + +\newcommand{\BlankNote}[1][0pt]{% +\tikzNote[#1]{}{}{}% +} + + + +\newcounter{examplenum}[chapter] +\newenvironment{example}[1][\relax]{ +\refstepcounter{examplenum} +\textbf{Example \thechapter.\arabic{examplenum}.{#1}} +}{% +\hfill {\Large $\diamond$} +%\centerline{\rule{5in}{.5pt}} +} + + +\newcounter{myenumi} +\newcommand{\saveenumi}{\setcounter{myenumi}{\value{enumi}}} +\newcommand{\reuseenumi}{\setcounter{enumi}{\value{myenumi}}} + +\newcommand{\cran}{\href{http://www.R-project.org/}{CRAN}} + +%%%%%%%% tufte-book work-arounds %%%%%%%%% +\makeatletter +\newenvironment{widestuff}% +{ +\hspace*{-2ex}\begin{minipage}{\@tufte@fullwidth}}% +{\end{minipage}} +\makeatother + +%%%%%%%% Some R Stuff %%%%%%%%%%%%%%% +\newcommand{\rterm}[1]{\textbf{#1}} +\def\R{{\sf R}} +\def\Rstudio{{\sf RStudio}} +\def\RStudio{{\sf RStudio}} +\def\term#1{\textbf{#1}} +\def\tab#1{{\sf #1}} + +%%%%%%%%%%%%% some boxed elements %%%%%%%%%%%%%%%% + +\newlength{\tempfmlength} +\newsavebox{\fmbox} +\newenvironment{fmpage}[1] +{ +\medskip +\setlength{\tempfmlength}{#1} +\begin{lrbox}{\fmbox} +\begin{minipage}{#1} +\vspace*{.02\tempfmlength} +\hfill +\begin{minipage}{.95 \tempfmlength}} +{\end{minipage}\hfill +\vspace*{.015\tempfmlength} +\end{minipage}\end{lrbox}\fbox{\usebox{\fmbox}} +\medskip +} + +\newenvironment{boxedText}[1][.98\textwidth]% +{% +\begin{center} +\begin{fmpage}{#1} +}% +{% +\end{fmpage} +\end{center} +} + +\newenvironment{boxedTable}[2][tbp]% +{% +\begin{table}[#1] + \refstepcounter{table} + \begin{center} +\begin{fmpage}{.98\textwidth} + \begin{center} + \sf \large Box~\expandafter\thetable. #2 +\end{center} +\medskip +}% +{% +\end{fmpage} +\end{center} +\end{table} % need to do something about exercises that follow boxedTable +} + +%%% indexing %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%\newcommand{\printindex}[1]{\relax} +%\newcommand{\indexchap}[1]{\relax} +%\RequirePackage{amsmidx} +\RequirePackage{makeidx} + +\newcommand{\exampleidx}[1]{{\it #1}} +\newcommand{\defidx}[1]{{\bf #1}} +\newcommand{\mainidx}[1]{{\bf #1}} +\newcommand{\probidx}[1]{{{\underline{#1}}}} +\makeindex +%\makeindex{Rindex} +%\makeindex{mainIndex} +%\newcommand{\Rindex}[1]{\index{Rindex}{#1@\texttt{#1}}} +%\newcommand{\myindex}[1]{\index{mainIndex}{#1}} +%\newcommand{\mathindex}[1]{\index{mainIndex}{$#1$}} +\newcommand{\Rindex}[1]{\index{#1@\texttt{#1}}} +\newcommand{\myindex}[1]{\index{#1}} +\newcommand{\mathindex}[1]{\index{$#1$}} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%%%% probably don't need this --rjp +\makeatletter +\newcommand\gobblepar{% + \@ifnextchar\par% + {\expandafter\gobblepar\@gobble}% + {}} +\makeatother + +\pagestyle{fancy} + +%% allow more of page to be used for a figure or table +\renewcommand{\textfraction}{0.05} +\renewcommand{\topfraction}{0.8} +\renewcommand{\bottomfraction}{0.8} +\renewcommand{\floatpagefraction}{0.75} +\setcounter{topnumber}{4} +\setcounter{bottomnumber}{4} +\setcounter{totalnumber}{8} + diff --git a/StudentGuide/RBooks.sty b/StudentGuide/RBooks.sty new file mode 100644 index 0000000..9ccc11f --- /dev/null +++ b/StudentGuide/RBooks.sty @@ -0,0 +1,183 @@ +\documentclass[open-any,12pt]{tufte-book} +\setcounter{secnumdepth}{3} +\usepackage{import} +\usepackage{graphicx} + +\usepackage{alltt} +\usepackage{mparhack} +\usepackage{xstring} + +\usepackage{etoolbox} +\usepackage{multicol} +\usepackage{xcolor} +\usepackage{framed} +\usepackage{hyperref} +%%%%%%%%%%%% Things Danny omitted %%%%%%%%%%%%%%%%%%%%% +\usepackage{fancyhdr} + +%\newdimen\Rwidth +%\Rwidth=\textwidth + +%\usepackage[margin=.5in,outer=1.5in,inner=.9in,includehead,includefoot,paperwidth=7.25in,paperheight=9.5in]{geometry} +\usepackage{probstat} +\usepackage[shownotes]{authNote} +% \usepackage[hidenotes]{authNote} +\usepackage[answerdelayed,exercisedelayed,lastexercise,chapter]{problems} +\usepackage{longtable} +\usepackage{language} + +\usepackage{tikz} +\usetikzlibrary{shadows} +\usetikzlibrary{decorations} +\usetikzlibrary{shapes.multipart} +\usetikzlibrary{shapes.symbols} +\usetikzlibrary{shapes.misc} +\usetikzlibrary{shapes.geometric} + +\newcommand{\mymarginpar}[1]{% +\vadjust{\smash{\llap{\parbox[t]{\marginparwidth}{#1}\kern\marginparsep}}}} + +\newcommand{\tikzNote}[3]{% +\marginpar[% +\hspace*{0.5in} +\parbox{1.2in}{\begin{tikzpicture} +\node at (0,0) [#3] +{\parbox{1.05in}{\footnotesize {\sc #1 }{\raggedright #2}}}; +\end{tikzpicture}} +]{% +\parbox{1.2in}{\begin{tikzpicture} +\node at (0,0) [#3] +{\parbox{1.05in}{\footnotesize {\sc #1 }{\raggedright #2}}}; +\end{tikzpicture}}} +} + +\renewcommand{\tikzNote}[3]{ +\marginnote{ +\textsc{#1} + +#2 +} +} + +\newcommand{\InstructorNote}[2][\relax]{% +\tikzNote{#1}{#2}{double copy shadow={opacity=.5},tape,fill=blue!10,draw=blue,thick} +} + +\renewcommand{\InstructorNote}[2][\relax]{% +\tikzNote{#1}{#2}{tape,fill=blue!10,draw=blue,thick} +} + +\newcommand{\DiggingDeeper}[2][\centerline{Digging Deeper}]{% +\tikzNote{#1}{#2}{tape,fill=blue!10,draw=blue,thick} +} + + +\newcommand{\TeachingTip}[2][\centerline{Teaching Tip}]{% +\tikzNote{#1}{#2}{tape,fill=blue!10,draw=blue,thick} +} + + +\newcommand{\FoodForThought}[2][\relax]{% +\tikzNote{#1}{#2}{rectangle,fill=green!10,draw=green,thick} +} + +\newcommand{\SuggestionBox}[2][\centerline{Suggestion Box}]{% +\tikzNote{#1}{#2}{rectangle,fill=green!10,draw=green,thick} +} + +\newcommand{\Caution}[2][\centerline{Caution!}]{% +\tikzNote{#1}{#2}{chamfered rectangle,fill=red!10,draw=red,thick} +} + +\newcounter{examplenum}[chapter] +\newenvironment{example}[1][\relax]{ +\refstepcounter{examplenum} +\textbf{Example \thechapter.\arabic{examplenum}.{#1}} +}{% +\hfill {\Large $\diamond$} +%\centerline{\rule{5in}{.5pt}} +} + +\usepackage[utf8]{inputenc} + +\newcounter{myenumi} +\newcommand{\saveenumi}{\setcounter{myenumi}{\value{enumi}}} +\newcommand{\reuseenumi}{\setcounter{enumi}{\value{myenumi}}} + +\newcommand{\cran}{\href{http://www.R-project.org/}{CRAN}} +%%%%%%%% Some R Stuff %%%%%%%%%%%%%%% +\newcommand{\rterm}[1]{\textbf{#1}} +\def\R{{\sf R}} +\def\Rstudio{{\sf RStudio}} +\def\RStudio{{\sf RStudio}} +\def\term#1{\textbf{#1}} +\def\tab#1{{\sf #1}} + +%%%%%%%%%%%%% some boxed elements %%%%%%%%%%%%%%%% + +\newlength{\tempfmlength} +\newsavebox{\fmbox} +\newenvironment{fmpage}[1] +{ +\medskip +\setlength{\tempfmlength}{#1} +\begin{lrbox}{\fmbox} +\begin{minipage}{#1} +\vspace*{.02\tempfmlength} +\hfill +\begin{minipage}{.95 \tempfmlength}} +{\end{minipage}\hfill +\vspace*{.015\tempfmlength} +\end{minipage}\end{lrbox}\fbox{\usebox{\fmbox}} +\medskip +} + +\newenvironment{boxedText}[1][.98\textwidth]% +{% +\begin{center} +\begin{fmpage}{#1} +}% +{% +\end{fmpage} +\end{center} +} + +\newenvironment{boxedTable}[2][tbp]% +{% +\begin{table}[#1] + \refstepcounter{table} + \begin{center} +\begin{fmpage}{.98\textwidth} + \begin{center} + \sf \large Box~\expandafter\thetable. #2 +\end{center} +\medskip +}% +{% +\end{fmpage} +\end{center} +\end{table} % need to do something about exercises that follow boxedTable +} + +%%% indexing %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%%% +\newcommand{\printindex}[1]{\relax} +\newcommand{\indexchap}[1]{\relax} +\usepackage{amsmidx} +\newcommand{\exampleidx}[1]{{\it #1}} +\newcommand{\defidx}[1]{{\bf #1}} +\newcommand{\mainidx}[1]{{\bf #1}} +\newcommand{\probidx}[1]{{{\underline{#1}}}} +\makeindex{Rindex} +\makeindex{mainIndex} +\newcommand{\Rindex}[1]{\index{Rindex}{#1@\texttt{#1}}} +\newcommand{\myindex}[1]{\index{mainIndex}{#1}} +\newcommand{\mathindex}[1]{\index{mainIndex}{$#1$}} + + +\pagestyle{fancy} + + +\begin{document} + +\end{document} diff --git a/Compendium/Compendium-Printed-Form.pdf b/StudentGuide/StudentGuide-Printed-Form.pdf similarity index 100% rename from Compendium/Compendium-Printed-Form.pdf rename to StudentGuide/StudentGuide-Printed-Form.pdf diff --git a/Compendium/Compendium-Printed-Form.tex b/StudentGuide/StudentGuide-Printed-Form.tex similarity index 100% rename from Compendium/Compendium-Printed-Form.tex rename to StudentGuide/StudentGuide-Printed-Form.tex diff --git a/StudentGuide/Studentguide2015-10-25.pdf b/StudentGuide/Studentguide2015-10-25.pdf new file mode 100644 index 0000000..0153594 Binary files /dev/null and b/StudentGuide/Studentguide2015-10-25.pdf differ diff --git a/StudentGuide/Studentguide2015-11-09.pdf b/StudentGuide/Studentguide2015-11-09.pdf new file mode 100644 index 0000000..28e3d27 Binary files /dev/null and b/StudentGuide/Studentguide2015-11-09.pdf differ diff --git a/StudentGuide/Studentguide2015-11-15.pdf b/StudentGuide/Studentguide2015-11-15.pdf new file mode 100644 index 0000000..ac69273 Binary files /dev/null and b/StudentGuide/Studentguide2015-11-15.pdf differ diff --git a/Compendium/SurvivalTime.Rnw b/StudentGuide/SurvivalTime.Rnw similarity index 65% rename from Compendium/SurvivalTime.Rnw rename to StudentGuide/SurvivalTime.Rnw index 1dfe818..a4e7c44 100644 --- a/Compendium/SurvivalTime.Rnw +++ b/StudentGuide/SurvivalTime.Rnw @@ -9,18 +9,21 @@ Extensive support for survival (time to event) analysis is available within the \myindex{Kaplan-Meier plot}% \Rindex{survfit()}% \Rindex{Surv()}% -\Rindex{conf.int option}% +\Rindex{gf\_step()}% \Rindex{xlab option}% +\Rindex{title option}% +\Rindex{ylab option}% +\Rindex{linetype option}% \begin{center} <>= -require(survival) +library(survival) +library(broom) fit <- survfit(Surv(dayslink, linkstatus) ~ treat, - data=HELPrct) -plot(fit, conf.int=FALSE, lty=1:2, lwd=2, - xlab="time (in days)", ylab="P(not linked)") -legend(20, 0.4, legend=c("Control", "Treatment"), - lty=c(1,2), lwd=2) -title("Product-Limit Survival Estimates (time to linkage)") + data = HELPrct) +fit <- broom::tidy(fit) +gf_step(fit, estimate ~ time, linetype = ~ strata, + title = "Product-Limit Survival Estimates (time to linkage)", + xlab = "time (in days)", ylab = "P(not linked)") @ \end{center} @@ -33,10 +36,10 @@ link to primary care (less likely to ``survive'') than the control (usual care) \Rindex{coxph()}% <>= -require(survival) +library(survival) summary(coxph(Surv(dayslink, linkstatus) ~ age + substance, - data=HELPrct)) + data = HELPrct)) @ -Neither age or substance group was significantly associated with linkage to primary care. +Neither age nor substance group was significantly associated with linkage to primary care. diff --git a/Compendium/TwoCategorical.Rnw b/StudentGuide/TwoCategorical.Rnw similarity index 58% rename from Compendium/TwoCategorical.Rnw rename to StudentGuide/TwoCategorical.Rnw index d35526a..a1cdeee 100644 --- a/Compendium/TwoCategorical.Rnw +++ b/StudentGuide/TwoCategorical.Rnw @@ -12,55 +12,87 @@ for homeless status (homeless one or more nights in the past 6 months or housed) and sex. <>= -tally(~ homeless + sex, margins=FALSE, data=HELPrct) +tally(~ homeless + sex, margins = FALSE, data = HELPrct) @ We can also calculate column percentages: \Rindex{tally()}% <>= -tally(~ sex | homeless, margins=TRUE, format="percent", - data=HELPrct) +tally(~ sex | homeless, margins = TRUE, format = "percent", + data = HELPrct) @ We can calculate the odds ratio directly from the table: <>= -OR <- (40/169)/(67/177); OR +OR <- (40/169)/(67/177) +OR @ The \pkg{mosaic} package has a function which will calculate odds ratios: \Rindex{oddsRatio()}% <>= -oddsRatio(tally(~ (homeless=="housed") + sex, margins=FALSE, - data=HELPrct)) +oddsRatio(tally(~ (homeless == "housed") + sex, margins = FALSE, + data = HELPrct)) +@ + +The \function{CrossTable()} function in the \pkg{gmodels} package also displays +a cross classification table. + +\Rindex{CrossTable()}% +<>= +library(gmodels) +with(HELPrct, CrossTable(homeless, sex, + prop.r = FALSE, prop.chisq = FALSE, prop.t = FALSE)) @ Graphical summaries of cross classification tables may be helpful in visualizing associations. Mosaic plots are one example, where the total area (all observations) is proportional to one. \Caution{The jury is still out -regarding the utility of mosaic plots, relative to the low data to ink ratio\cite{Tufte:2001:Visual}. But we have found them to be helpful to reinforce understanding of a two way contingency table.}% +regarding the utility of mosaic plots (also known as eikosograms), +due to their low data to ink ratio.\cite{Tufte:2001:Visual}% +We have found them to be helpful to reinforce understanding of a two way contingency table.}% Here we see that males tend to be over-represented amongst the homeless subjects (as represented by the horizontal line which is higher for the homeless rather than the housed). \FoodForThought{The \function{mosaic()} function -in the \pkg{vcd} package also makes mosaic plots.} -\Rindex{mosaicplot()}% +in the \pkg{vcd} package makes mosaic plots.} +\Rindex{mosaic()}% +\Rindex{vcd package}% \begin{center} <>= -mytab <- tally(~ homeless + sex, margins=FALSE, - data=HELPrct) -mosaicplot(mytab) +mytab <- tally(~ homeless + sex, margins = FALSE, + data = HELPrct) +vcd::mosaic(mytab) +@ +<>= +vcd::mosaic(~ homeless + substance, data = HELPrct, + shade = TRUE) #example with color @ \end{center} -\newpage +\section{Creating tables from summary statistics} + +Tables can be created from summary statistics using the \function{do} function. + +\Rindex{do()}% +\Rindex{rbind()}% +<<>>= +HELPtable <- rbind( + do(40) * data.frame(sex = "female", homeless = "homeless"), + do(169) * data.frame(sex = "male", homeless = "homeless"), + do(67) * data.frame(sex = "female", homeless = "housed"), + do(177) * data.frame(sex = "male", homeless = "housed") +) +tally(~ homeless + sex, data = HELPtable) +@ \section{Chi-squared tests} \Rindex{chisq.test()}% <>= -chisq.test(tally(~ homeless + sex, margins=FALSE, - data=HELPrct), correct=FALSE) +chisq.test(tally(~ homeless + sex, margins = FALSE, + data = HELPrct), correct = FALSE) @ There is a statistically significant association found: it is unlikely that we would observe @@ -74,8 +106,8 @@ The \function{xchisq.test()} function provides additional details (observed, exp \Rindex{xchisq.test()}% <>= -xchisq.test(tally(~homeless + sex, margins=FALSE, - data=HELPrct), correct=FALSE) +xchisq.test(tally(~ homeless + sex, margins = FALSE, + data = HELPrct), correct = FALSE) @ We observe that there are fewer homeless women, and more homeless men that would be expected. @@ -92,6 +124,6 @@ The \function{fisher.test()} function uses a different estimator (and different on the profile likelihood).} \Rindex{fisher.test()}% <>= -fisher.test(tally(~homeless + sex, margins=FALSE, - data=HELPrct)) +fisher.test(tally(~ homeless + sex, margins = FALSE, + data = HELPrct)) @ diff --git a/Compendium/TwoQuantitative.Rnw b/StudentGuide/TwoQuantitative.Rnw similarity index 67% rename from Compendium/TwoQuantitative.Rnw rename to StudentGuide/TwoQuantitative.Rnw index 8d4029f..6442505 100644 --- a/Compendium/TwoQuantitative.Rnw +++ b/StudentGuide/TwoQuantitative.Rnw @@ -12,16 +12,15 @@ with a lowess (locally weighted scatterplot smoother) line, using a circle as the plotting character and slightly thicker line. \InstructorNote{The lowess line can help to assess linearity of a relationship. This is added by specifying both points (using `p') and a lowess smoother.} -\Rindex{xyplot()}% -\Rindex{pch option}% -\Rindex{cex option}% -\Rindex{lwd option}% -\Rindex{type option}% +\Rindex{gf\_point()}% +\Rindex{shape option}% +\Rindex{size option}% +\Rindex{gf\_smooth()}% \begin{center} -<>= -females <- filter(HELPrct, female==1) -xyplot(cesd ~ mcs, type=c("p","smooth"), pch=1, cex=0.6, - lwd=3, data=females) +<>= +Female <- filter(HELPrct, female == 1) +gf_point(cesd ~ mcs, data = Female, shape = 1) %>% + gf_smooth(se = FALSE, size = 2) @ \end{center} \DiggingDeeper{The \emph{Start Modeling with R} companion book will be helpful if you are unfamiliar with the @@ -29,18 +28,15 @@ modeling language. The \emph{Start Teaching with R} also provides useful guidan It's straightforward to plot something besides a character in a scatterplot. In this example, the \dataframe{USArrests} can be used to plot the association -between murder and assault rates, with the state name displayed. This -requires a panel function to be written. -\Rindex{function()}% -\Rindex{panel.labels()}% -\Rindex{panel.text()}% +between murder and assault rates, with the state name displayed. \Rindex{rownames()}% -<>= -panel.labels <- function(x, y, labels='x',...) { - panel.text(x, y, labels, cex=0.4, ...) -} -xyplot(Murder ~ Assault, panel=panel.labels, - labels=rownames(USArrests), data=USArrests) +\Rindex{gf\_text()}% +\Rindex{label option}% +\Rindex{size option}% +<>= +gf_text(Murder ~ Assault, + label = ~ rownames(USArrests), + data = USArrests) @ @@ -48,17 +44,13 @@ xyplot(Murder ~ Assault, panel=panel.labels, \section{Correlation} -<>= -#detach(package:MASS) # where did MASS come from? (histogram in mosaic XX) -@ - Correlations can be calculated for a pair of variables, or for a matrix of variables. \myindex{correlation}% \Rindex{cor()}% <<>>= -cor(cesd, mcs, data=females) -smallHELP <- select(females, cesd, mcs, pcs) +cor(cesd ~ mcs, data = Female) +smallHELP <- select(Female, cesd, mcs, pcs) cor(smallHELP) @ \myindex{Pearson correlation}% @@ -67,7 +59,7 @@ cor(smallHELP) By default, Pearson correlations are provided. Other variants (e.g., Spearman) can be specified using the \option{method} option. <<>>= -cor(cesd, mcs, method="spearman", data=females) +cor(cesd ~ mcs, method = "spearman", data = Female) @ \section{Pairs plots} @@ -75,10 +67,11 @@ cor(cesd, mcs, method="spearman", data=females) \myindex{scatterplot matrix}% A pairs plot (scatterplot matrix) can be calculated for each pair of a set of variables. -\TeachingTip{The \pkg{GGally} package has support for more elaborate pairs plots.} -\Rindex{splom()}% -<>= -splom(smallHELP) +\FoodForThought{The \pkg{GGally} package has support for more elaborate pairs plots.} +\Rindex{ggpairs()}% +<>= +library(GGally) +ggpairs(smallHELP) @ \section{Simple linear regression} @@ -98,7 +91,7 @@ and predictors. Here we consider fitting the model \model{\variable{cesd}}{\var \Rindex{lm()}% \Rindex{coef()}% <<>>= -cesdmodel <- lm(cesd ~ mcs, data=females) +cesdmodel <- lm(cesd ~ mcs, data = Female) coef(cesdmodel) @ \InstructorNote{It's important to pick good names for modeling @@ -108,14 +101,14 @@ scores.} To simplify the output, we turn off the option to display significance stars. \myindex{significance stars}% -\Rindex{summary()}% +\Rindex{msummary()}% \Rindex{confint()}% \Rindex{rsquared()}% \Rindex{coef()}% <<>>= -options(show.signif.stars=FALSE) +options(show.signif.stars = FALSE) coef(cesdmodel) -summary(cesdmodel) +msummary(cesdmodel) coef(summary(cesdmodel)) confint(cesdmodel) rsquared(cesdmodel) @@ -140,67 +133,70 @@ returns a vector of predicted values.} \Rindex{density option}% \begin{center} <>= -histogram(~ residuals(cesdmodel), density=TRUE) +gf_histogram(~ residuals(cesdmodel), density = TRUE) @ \end{center} -\Rindex{qqmath()}% +\Rindex{gf\_qq()}% \begin{center} <>= -qqmath(~ resid(cesdmodel)) +gf_qq(~ resid(cesdmodel)) @ \end{center} \Rindex{alpha option}% +\Rindex{gf\_hline()}% \begin{center} -<>= -xyplot(resid(cesdmodel) ~ fitted(cesdmodel), type=c("p", "smooth", "r"), - alpha=0.5, cex=0.3, pch=20) +<>= +gf_point(resid(cesdmodel) ~ fitted(cesdmodel), + alpha = 0.5, cex = 0.3, pch = 20) %>% + gf_smooth(se = FALSE) %>% + gf_hline(yintercept = 0) @ \end{center} -The \function{mplot()} function can facilitate creating a variety of useful plots, including the same residuals vs. fitted scatterplots, by specifying the \option{which=1} option. +The \function{mplot()} function can facilitate creating a variety of useful plots, including the same residuals vs. fitted scatterplots, by specifying the \option{which = 1} option. \Rindex{mplot()}% \Rindex{which option}% -<>= -mplot(cesdmodel, which=1) +<>= +mplot(cesdmodel, which = 1) @ -It can also generate a normal quantile-quantile plot (\option{which=2}), -<>= -mplot(cesdmodel, which=2) +It can also generate a normal quantile-quantile plot (\option{which = 2}), +<>= +mplot(cesdmodel, which = 2) @ \myindex{scale versus location}% scale vs.\,location, -<>= -mplot(cesdmodel, which=3) +<>= +mplot(cesdmodel, which = 3) @ \myindex{Cook's distance}% Cook's distance by observation number, -<>= -mplot(cesdmodel, which=4) +<>= +mplot(cesdmodel, which = 4) @ -\newpage - \myindex{leverage}% -residuals vs.\,leverage -<>= -mplot(cesdmodel, which=5) +residuals vs.\,leverage, +<>= +mplot(cesdmodel, which = 5) @ -Cook's distance vs. leverage. -<>= -mplot(cesdmodel, which=6) +and Cook's distance vs. leverage. +<>= +mplot(cesdmodel, which = 6) @ -\myindex{prediction bands}% -\Rindex{panel.lmbands()}% -\Rindex{band.lwd option}% -Prediction bands can be added to a plot using the \function{panel.lmbands()} function. +\myindex{prediction intervals}% +\Rindex{gf\_lm()}% +\Rindex{interval option}% +Prediction intervals can be added to a plot using the \option{interval} option in \function{gf\_lm()}. \begin{center} -<<>>= -xyplot(cesd ~ mcs, panel=panel.lmbands, cex=0.2, band.lwd=2, data=HELPrct) +<>= +gf_point(cesd ~ mcs, data = HELPrct) %>% + gf_lm(interval = "confidence", fill = "red") %>% + gf_lm(interval = "prediction", fill = "navy") @ \end{center} diff --git a/StudentGuide/authNote.sty b/StudentGuide/authNote.sty new file mode 100644 index 0000000..e332349 --- /dev/null +++ b/StudentGuide/authNote.sty @@ -0,0 +1,209 @@ + +\NeedsTeXFormat{LaTeX2e}[1995/12/01] +\ProvidesPackage{authNote}[2005/06/14 1.0 (RJP)] + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% my package requirements +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\RequirePackage{ifthen} + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% options and booleans for them +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +%\reversemarginpar % ?? + +%\smartqed % makes \qed print at the rightmargin + +\newboolean{shownotes} +\setboolean{shownotes}{true} +\DeclareOption{hidenotes}{\setboolean{shownotes}{false}} +\DeclareOption{shownotes}{\setboolean{shownotes}{true}} +\DeclareOption{hide}{\setboolean{shownotes}{false}} +\DeclareOption{show}{\setboolean{shownotes}{true}} + +\newboolean{showhmm} +\setboolean{showhmm}{true} +\DeclareOption{hidehmm}{\setboolean{showhmm}{false}} +\DeclareOption{showhmm}{\setboolean{showhmm}{true}} + +\newboolean{showopt} +\setboolean{showopt}{true} +\DeclareOption{hideopt}{\setboolean{showopt}{false}} +\DeclareOption{showopt}{\setboolean{showopt}{true}} + +\newboolean{showold} +\setboolean{showold}{false} +\DeclareOption{showold}{\setboolean{showold}{true}} +\DeclareOption{hideold}{\setboolean{showold}{false}} + +\DeclareOption{primary}{% + \setboolean{showhmm}{true} + \setboolean{showopt}{true} + \setboolean{shownotes}{true} + \setboolean{showold}{false} + } + +\DeclareOption{secondary}{% + \setboolean{showhmm}{false} + \setboolean{showopt}{false} + \setboolean{shownotes}{true} + \setboolean{showold}{false} + } + +\DeclareOption{clean}{% + \setboolean{showhmm}{false} + \setboolean{showopt}{false} + \setboolean{shownotes}{false} + \setboolean{showold}{false} + } + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + +\ProcessOptions* + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%Translation Helps +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +%\def\transbox#1{{\textbf{#1}}} +\def\transbox#1{{\textbf{#1}}} + +\def\hmm[#1]#2{\ifthenelse% + {\boolean{showhmm}}% + {\transbox{#2}\smallmarginpar{#1}{}}% + {#2}% +} + +\def\hmmok[#1]#2{% + \ifthenelse{\boolean{showold}}% + {\transbox{#2}\smallmarginpar{#1}}% + {#2}% +} + +\def\hmmOK[#1]#2{% + \ifthenelse{\boolean{showold}}% + {\transbox{#2}\smallmarginpar{#1}}% + {#2}% +} + +\newcommand{\options}[2]{% + \ifthenelse{\boolean{showopt}}% +%{\textbf{$\mathbf<$#1 $\mathbf\mid$ #2$\mathbf >$}\smallmarginpar{$<\mid>$}}% + {% + \smallmarginpar{$<$#1$\mid$#2$>$}% + #1% + }% + {#1}% +} + + +\newcommand{\optionsok}[2]{% + \ifthenelse{\boolean{showold}}% +% {\textbf{$\mathbf<$#1$\mathbf >$}\smallmarginpar{$(<\mid>)$}}% + {% + \smallmarginpar{$<$#1$\mid$#2$>$}% + #1% + }% + {#1}% + } + +\newcommand{\optionsOk}[2]{% + \ifthenelse{\boolean{showold}}% +% {\textbf{$\mathbf<$#1$\mathbf >$}\smallmarginpar{$(<\mid>)$}}% + {% + \smallmarginpar{$<$#1$\mid$#2$>$}% + #1% + }% + {#1}% + } + +\newcommand{\optionsOne}[2]{% + \ifthenelse{\boolean{showold}}% + {\textbf{$\mathbf<$#1$\mathbf >$}\smallmarginpar{$(<\mid>)$}}% + {#1}% + } + +\newcommand{\optionsone}[2]{% + \ifthenelse{\boolean{showold}}% + {\textbf{$\mathbf<$#1$\mathbf >$}\smallmarginpar{$(<\mid>)$}}% + {#1}% + } + +\newcommand{\optionsTwo}[2]{% + \ifthenelse{\boolean{showold}}% + {\textbf{$\mathbf<$#2$\mathbf >$}\smallmarginpar{$(<\mid>)$}}% + {#2}% + } + +\newcommand{\optionstwo}[2]{% + \ifthenelse{\boolean{showold}}% + {\textbf{$\mathbf<$#2$\mathbf >$}\smallmarginpar{$(<\mid>)$}}% + {#2}% + } + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% authNote stuff +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +\newtoks\tempTok +\newcounter{noteNum}[section] +\newwrite\noteFile% +\immediate\openout\noteFile=\jobname.notes + +\long\def\saveNote#1{% +\refstepcounter{noteNum}% +\immediate\write\noteFile{\string\begingroup\string\bf }% +\immediate\write\noteFile{\thesection .% +\expandafter\arabic{noteNum}}% +\immediate\write\noteFile{(p. \expandafter\thepage): }% +\immediate\write\noteFile{\string\endgroup}% +\tempTok={#1} +\immediate\write\noteFile{\the\tempTok}% +\immediate\write\noteFile{}% +} + + +\def\smallmarginpar#1{\marginpar[\hfill \tiny #1]{\raggedright \tiny #1 \hfill}} + +\long\def\saveNshowNote#1#2{% + \saveNote{#2}% + \ifthenelse{\boolean{shownotes}}{% + \marginpar[\hfill {\tiny #1 + \thesection.\arabic{noteNum} $\rightarrow$}]% + {{\tiny $\leftarrow$ \thesection.\arabic{noteNum} #1 \hfill}}% + }{\relax}% +} + +% to remove marginal notes (for submissions, etc) use below instead: + +\long\def\authNote#1{\saveNshowNote{}{#1}} +\long\def\oldNote#1{\saveNshowNote{old}{old: #1}} +\long\def\authNoted#1{% +\ifthenelse{\boolean{showold}}% +{\saveNshowNote{$\surd$}{(Done) #1}}% +{\relax}% +} + +\long\def\authNotedOld#1{\relax} + + +\def\authNotes{% +\ifthenelse{\boolean{shownotes}}{% +%\section*{Author Notes} +\begingroup +\immediate\closeout\noteFile +\parindent=0pt +\input \jobname.notes +\endgroup +} +{\relax} +} + + diff --git a/StudentGuide/language.sty b/StudentGuide/language.sty new file mode 100644 index 0000000..54006ce --- /dev/null +++ b/StudentGuide/language.sty @@ -0,0 +1,44 @@ +\ProvidesPackage{language} + +\RequirePackage{xstring} +\RequirePackage{xcolor} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% Looking for a consistent typography for language elements. + +\providecommand{\R}{} +\renewcommand{\R}{\mbox{\sf{R}}} +\providecommand{\RStudio}{} +\renewcommand{\RStudio}{\mbox{\sf{R}Studio}} +\providecommand{\Sage}{} +\renewcommand{\Sage}{\mbox{\sf{Sage}}} + +\providecommand{\variable}[1]{} +\renewcommand{\variable}[1]{{\color{green!50!black}\texttt{#1}}} +\providecommand{\dataframe}[1]{} +\renewcommand{\dataframe}[1]{{\color{blue!80!black}\texttt{#1}}} +\providecommand{\function}[1]{} +\renewcommand{\function}[1]{{\color{purple!75!blue}\texttt{\StrSubstitute{#1}{()}{}()}}} +\providecommand{\option}[1]{} +\renewcommand{\option}[1]{{\color{brown!80!black}\texttt{#1}}} +\providecommand{\pkg}[1]{} +\renewcommand{\pkg}[1]{{\color{red!80!black}\texttt{#1}}} +\providecommand{\code}[1]{} +\renewcommand{\code}[1]{{\color{blue!80!black}\texttt{#1}}} + +\providecommand{\file}[1]{} +\renewcommand{\file}[1]{{\tt #1}} + +% This looks really hokey. Probably need to redefine this. +\providecommand{\model}[2]{} +\renewcommand{\model}[2]{{$\,$\hbox{#1}\ \ensuremath{\sim}\ \hbox{#2}}} + +% These should be considered deprecated -- cease and disist +\providecommand{\VN}[1]{} +\renewcommand{\VN}[1]{{\color{green!50!black}\texttt{#1}}} +\providecommand{\vn}[1]{} +\renewcommand{\vn}[1]{{\color{green!50!black}\texttt{#1}}} +\providecommand{\DFN}[1]{} +\renewcommand{\DFN}[1]{{\color{blue!80!black}\texttt{#1}}} +\providecommand{\dfn}[1]{} +\renewcommand{\dfn}[1]{{\color{blue!80!black}\texttt{#1}}} + diff --git a/StudentGuide/markdown1.png b/StudentGuide/markdown1.png new file mode 100644 index 0000000..f6d20b2 Binary files /dev/null and b/StudentGuide/markdown1.png differ diff --git a/StudentGuide/markdown2.png b/StudentGuide/markdown2.png new file mode 100644 index 0000000..ea10ab2 Binary files /dev/null and b/StudentGuide/markdown2.png differ diff --git a/StudentGuide/markdown3.png b/StudentGuide/markdown3.png new file mode 100644 index 0000000..a7460ed Binary files /dev/null and b/StudentGuide/markdown3.png differ diff --git a/StudentGuide/markdown4.png b/StudentGuide/markdown4.png new file mode 100644 index 0000000..02b6816 Binary files /dev/null and b/StudentGuide/markdown4.png differ diff --git a/StudentGuide/problems.sty b/StudentGuide/problems.sty new file mode 100644 index 0000000..a8eef68 --- /dev/null +++ b/StudentGuide/problems.sty @@ -0,0 +1,258 @@ +\NeedsTeXFormat{LaTeX2e}[1999/12/01] +\ProvidesPackage{amsprobs} + [2007/12/11 v0.1 problems package (R. Pruim (based on P.Pichaureau))] +%% \CharacterTable +%% {Upper-case \A\B\C\D\E\F\G\H\I\J\K\L\M\N\O\P\Q\R\S\T\U\V\W\X\Y\Z +%% Lower-case \a\b\c\d\e\f\g\h\i\j\k\l\m\n\o\p\q\r\s\t\u\v\w\x\y\z +%% Digits \0\1\2\3\4\5\6\7\8\9 +%% Exclamation \! Double quote \" Hash (number) \# +%% Dollar \$ Percent \% Ampersand \& +%% Acute accent \' Left paren \( Right paren \) +%% Asterisk \* Plus \+ Comma \, +%% Minus \- Point \. Solidus \/ +%% Colon \: Semicolon \; Less than \< +%% Equals \= Greater than \> Question mark \? +%% Commercial at \@ Left bracket \[ Backslash \\ +%% Right bracket \] Circumflex \^ Underscore \_ +%% Grave accent \` Left brace \{ Vertical bar \| +%% Right brace \} Tilde \~} +%% +\newif\if@AnswerOutput \@AnswerOutputtrue +\newif\if@AnswerDelay \@AnswerDelayfalse +\newif\if@ExerciseOutput \@ExerciseOutputtrue +\newif\if@ExerciseDelay \@ExerciseDelayfalse +\newif\if@AswLastExe \@AswLastExefalse +\newif\if@ShowLabel \@ShowLabelfalse +\newif\if@NumberInChapters \@NumberInChaptersfalse +\newif\if@NumberInSections \@NumberInSectionsfalse +\newif\if@DottedProbNumbers \@DottedProbNumbersfalse + +\DeclareOption{dotted} {\@DottedProbNumberstrue} +\DeclareOption{dotless} {\@DottedProbNumbersfalse} +\DeclareOption{noanswer} {\@AnswerOutputfalse} +\DeclareOption{answeronly} {\@ExerciseOutputfalse} +\DeclareOption{noexercise} {\@ExerciseOutputfalse} +\DeclareOption{exerciseonly} {\@AnswerOutputfalse} +\DeclareOption{outputnothing}{\@ExerciseOutputfalse\@AnswerOutputfalse} +\DeclareOption{exercisedelayed}{\@ExerciseDelaytrue} +\DeclareOption{answerdelayed}{\@AnswerDelaytrue} +\DeclareOption{lastexercise} {\@AswLastExetrue} +\DeclareOption{showlabel} {\@ShowLabeltrue} +\DeclareOption{chapter} {\@NumberInChapterstrue} +\DeclareOption{section} {\@NumberInSectionstrue} + +\ProcessOptions +\RequirePackage{keyval, ifthen} +\RequirePackage{xspace} + +\newbox\problemset@bin +\newbox\problem@bin +\newbox\solution@bin +\newbox\solutionset@bin +\newbox\studentsolution@bin +\newbox\studentsolutionset@bin +\global\setbox\problem@bin=\vbox{} +\global\setbox\problemset@bin=\vbox{} +\global\setbox\solution@bin=\vbox{} +\global\setbox\solutionset@bin=\vbox{} +\global\setbox\studentsolution@bin=\vbox{} +\global\setbox\studentsolutionset@bin=\vbox{} + +\def\renewcounter#1{% + \@ifundefined{c@#1} + {\@latex@error{counter #1 undefined}\@ehc}% + \relax + \let\@ifdefinable\@rc@ifdefinable + \@ifnextchar[{\@newctr{#1}}{}} + +\newcounter{problemNum} +\renewcommand{\theproblemNum}{\arabic{problemNum}} + +\if@NumberInSections + \renewcounter{problemNum}[section] + \renewcommand{\theproblemNum}{\thesection.\arabic{problemNum}}% +\fi + +\if@NumberInChapters + \renewcounter{problemNum}[chapter]% + \renewcommand{\theproblemNum}{\thechapter.\arabic{problemNum}}% +\fi + +\def\Rausskip{\ \vspace{-1\baselineskip}} +\def\Rausskip{\ \vspace{-.5\baselineskip}} + +\newenvironment{problem}% +{% +\refstepcounter{problemNum}% +%\begingroup% +\renewcommand{\labelenumi}{\textbf{\alph{enumi})}}% +\renewcommand{\labelenumii}{\roman{enumii}.}% +\renewcommand{\labelenumiii}{\Alph{enumiii}.}% +\renewcommand{\theenumi}{{\alph{enumi}}}% +\renewcommand{\theenumii}{\roman{enumii}}% +\renewcommand{\theenumiii}{\Alph{enumiii}}% +\global\setbox\problem@bin=\vbox\bgroup% +\noindent\textbf{\thechapter.\arabic{problemNum}.}% +}{% +\egroup% +\global\setbox\problemset@bin=\vbox{% +\unvbox\problemset@bin% + +\bigskip + +\unvbox\problem@bin% +%\endgroup% +} +}% + + + + +\newboolean{StudentSolution} +\newboolean{InstructorSolution} +\setboolean{StudentSolution}{false} +\setboolean{InstructorSolution}{true} +\newenvironment{solution}[1][\@empty]% +{% +% Do this by default +\setboolean{StudentSolution}{false} +\setboolean{InstructorSolution}{true} + +% Modify based on #1 +\ifthenelse{\equal{#1}{both}}{ + \setboolean{StudentSolution}{true} + \setboolean{InstructorSolution}{true}}% + {\relax} + +\ifthenelse{\equal{#1}{student}}{ + \setboolean{StudentSolution}{true} + \setboolean{InstructorSolution}{false}}% + {\relax} + +\ifthenelse{\equal{#1}{instructor}}{ + \setboolean{StudentSolution}{false} + \setboolean{InstructorSolution}{true}}% + {\relax} + +\renewcommand{\labelenumi}{\textbf{\alph{enumi})}}% +\renewcommand{\labelenumii}{\roman{enumii}.}% +\renewcommand{\labelenumiii}{\Alph{enumiii}.}% +\renewcommand{\theenumi}{{\alph{enumi}}}% +\renewcommand{\theenumii}{\roman{enumii}}% +\renewcommand{\theenumiii}{\Alph{enumiii}}% +\renewcommand{\labelenumii}{\textbf{\alph{enumii})}}% +\renewcommand{\labelenumiii}{\roman{enumiii}.}% +\renewcommand{\labelenumiv}{\Alph{enumiv}.}% +\renewcommand{\theenumii}{{\alph{enumii}}}% +\renewcommand{\theenumiii}{\roman{enumiii}}% +\renewcommand{\theenumiv}{\Alph{enumiv}}% +\global\setbox\solution@bin=\vbox\bgroup% +%\noindent\textbf{Solution \thechapter.\arabic{problemNum}. }% +%\begin{enumerate} +%\item[\textbf{\thechapter.\arabic{problemNum}.}]% +\noindent \textbf{\thechapter.\arabic{problemNum}. }% +}{% +%\end{enumerate} +\egroup% +% +% save to instructor solution set (if we should) +% +\ifthenelse{\boolean{InstructorSolution}}{% +\global\setbox\solutionset@bin=\vbox{% +\unvbox\solutionset@bin% + +\bigskip + +\unvcopy\solution@bin% +}}{\relax}% +% +% save to student solution set (if we should) +% +\ifthenelse{\boolean{StudentSolution}}{% +\global\setbox\studentsolutionset@bin=\vbox{% +\unvbox\studentsolutionset@bin% + +\medskip + +\unvbox\solution@bin% +} +}{\relax} +}% + +\newenvironment{studentsolution}[1][\@empty]% +{% +%\begingroup% +\def\paramOne{#1} +\renewcommand{\labelenumii}{\textbf{\alph{enumii})}}% +\renewcommand{\labelenumiii}{\roman{enumiii}.}% +\renewcommand{\labelenumiv}{\Alph{enumiv}.}% +\renewcommand{\theenumii}{{\alph{enumii}}}% +\renewcommand{\theenumiii}{\roman{enumiii}}% +\renewcommand{\theenumiv}{\Alph{enumiv}}% +\global\setbox\studentsolution@bin=\vbox\bgroup% +\begin{enumerate} +\item[\textbf{\thechapter.\arabic{problemNum}.}]% +}{% +\end{enumerate} +\egroup% +\global\setbox\studentsolutionset@bin=\vbox{% +\unvbox\studentsolutionset@bin% + +\medskip + +\unvbox\studentsolution@bin% +%\endgroup% +} +}% + +\newenvironment{bothsolution} +{% +\renewcommand{\labelenumii}{\textbf{\alph{enumii})}}% +\renewcommand{\labelenumiii}{\roman{enumiii}.}% +\renewcommand{\labelenumiv}{\Alph{enumiv}.}% +\renewcommand{\theenumii}{{\alph{enumii}}}% +\renewcommand{\theenumiii}{\roman{enumiii}}% +\renewcommand{\theenumiv}{\Alph{enumiv}}% +\global\setbox\solution@bin=\vbox\bgroup% +\begin{enumerate} +\item[\textbf{\thechapter.\arabic{problemNum}.}]% +}{% +\end{enumerate} +\egroup% +\global\setbox\solutionset@bin=\vbox{% +\unvbox\solutionset@bin% + +\bigskip + +\unvcopy\solution@bin% +} +\global\setbox\studentsolutionset@bin=\vbox{% +\unvbox\studentsolutionset@bin% + +\medskip + +\unvbox\solution@bin% +} +}% + +\def\shipoutProblems{% +%\begin{xcb}{Exercises} +\unvbox\problemset@bin +\unvbox\problem@bin +%\end{xcb} +} + +\def\shipoutSolutions{% +\unvbox\solutionset@bin +\newpage +} + +\def\shipoutStudentSolutions{% +\unvbox\studentsolutionset@bin +\newpage +} + +\endinput + + + diff --git a/StudentGuide/probstat.sty b/StudentGuide/probstat.sty new file mode 100644 index 0000000..9231ad5 --- /dev/null +++ b/StudentGuide/probstat.sty @@ -0,0 +1,350 @@ + +\ProvidesPackage{probstat} +\RequirePackage{amsmath} +\RequirePackage{ifthen} +\RequirePackage{amsmath} +\RequirePackage{bm} +\RequirePackage{xcolor} +\RequirePackage{fancyvrb} + +\newboolean{longExp} +\setboolean{longExp}{false} +\DeclareOption{longExp}{\setboolean{longExp}{true}} +\DeclareOption{shortExp}{\setboolean{longExp}{false}} + +\ProcessOptions* + +\newcommand{\term}[1]{\textbf{#1}} +\newcommand{\code}[1]{{\tt #1}} +\newcommand{\file}[2][R]{{\tt #2}} +\newcommand{\command}[1]{\texttt{#1}} +%\newcommand{\R}{\mbox{\texttt{R}}} +\newcommand{\R}{\mbox{\sf{R}}} + +\newlength{\cwidth} +\newcommand{\cents}{\settowidth{\cwidth}{c}% +\divide\cwidth by2 +\advance\cwidth by-.1pt +c\kern-\cwidth +\vrule width .1pt depth.2ex height1.2ex +\kern\cwidth} + +\def\myRuleColor{\color{blue!45!white}} +\colorlet{myRuleColor}{blue!45!white} +\def\includeR#1{% +\typeout{Including R output from #1} +\VerbatimInput[framerule=.5mm, + frame=leftline, + rulecolor=\myRuleColor, + fontsize=\small]{#1} +} +\def\includeRtiny#1{% +\typeout{Including R output from #1} +\VerbatimInput[framerule=.5mm, + frame=leftline, + rulecolor=\myRuleColor, + fontsize=\footnotesize]{#1} +} + + +\DefineVerbatimEnvironment% +{Rcode}{Verbatim} +{framerule=.5mm,frame=leftline,rulecolor=\myRuleColor,fontsize=\small} + +\DefineVerbatimEnvironment% +{tinyRcode}{Verbatim} +{framerule=.5mm,frame=leftline,rulecolor=\myRuleColor,fontsize=\tiny} + +\DefineVerbatimEnvironment% +{footRcode}{Verbatim} +{framerule=.5mm,frame=leftline,rulecolor=\myRuleColor,fontsize=\footnotesize} + +\def\includeRaus#1{% + +\hfill \makebox[0pt]{\fbox{\tiny #1}} +\vspace*{-3ex} + +%\xmarginpar{\fbox{\tiny #1}}% +\includeR{Rout/#1.Raus} +%\hfill +%\rule{1in}{.3pt} +%\rule{1in}{.3pt} +%\hfill +} +\def\includeRchunk#1{% + +\hfill \makebox[0pt]{\fbox{\tiny #1}} +\vspace*{-3ex} + +%\xmarginpar{\fbox{\tiny #1}}% +\includeR{Rchunk/#1.Rchunk} +%\hfill +%\rule{1in}{.3pt} +%\rule{1in}{.3pt} +%\hfill +} + +\def\includeRausTwo#1{% + +\hfill \makebox[0pt]{\fbox{\tiny #1}} +\vspace*{-3ex} + +\begin{multicols}{2} +\includeR{Rout/#1.Raus} +\end{multicols} +} + + +%% basic probability stuff +\newcommand{\E}{\operatorname{E}} +\newcommand{\Prob}{\operatorname{P}} +\def\evProb#1{\Prob(\mbox{#1})} +\newcommand{\Var}{\operatorname{Var}} +\newcommand{\coVar}{\operatorname{Cov}} +\newcommand{\Cov}{\operatorname{Cov}} +\newcommand{\covar}{\operatorname{Cov}} +\newcommand{\argmax}{\operatorname{argmax}} +\newcommand{\argmin}{\operatorname{argmin}} + +\newcommand\simiid{\stackrel{\tiny \operatorname{iid}}{\sim}} + +\newcommand{\distribution}[1]{{\textsf{#1}}} +\gdef\Bin{\distribution{Binom}} +\gdef\Binom{\distribution{Binom}} +\gdef\Multinom{\distribution{Multinom}} +\gdef\NBinom{\distribution{NBinom}} +\gdef\Geom{\distribution{Geom}} +\gdef\Norm{\distribution{Norm}} +\gdef\Hyper{\distribution{Hyper}} +\gdef\Unif{\distribution{Unif}} +\ifthenelse{\boolean{longExp}}{% + \gdef\Exponential{\distribution{Exp}}% + }{% + \gdef\Exp{\distribution{Exp}}% + } +\gdef\Poisson{\distribution{Pois}} +\gdef\Pois{\distribution{Pois}} +\gdef\Gam{\distribution{Gamma}} +\gdef\Gamm{\distribution{Gamma}} +\gdef\Beta{\distribution{Beta}} +\gdef\Weibull{\distribution{Weibull}} +\gdef\Chisq{\distribution{Chisq}} +\gdef\Tdist{\distribution{T}} +\gdef\Fdist{\distribution{F}} + + +\def\mean#1{\overline{#1}} + +\def\Prob{\operatorname{P}} +\ifthenelse{\boolean{longExp}}{% + \def\Exp{\operatorname{E}} + }{% + \def\E{\operatorname{E}} + } +\def\Var{\operatorname{Var}} +\def\SD{\operatorname{SD}} + + +%% ANOVA abbreviations +\def\SE{SE} +\def\SSe{SSE} +\def\SSTot{SSTot} +\def\SSr{SSM} +\def\SSM{SSM} +\def\SSx{SS_x} +\def\SSy{SS_y} +\def\Sxy{S_{xy}} +\def\Sxx{S_{xx}} +\def\Syy{S_{yy}} + + + +%% some colors +\colorlet{trCol}{green!50!black} +\colorlet{erCol}{red!70!black} +\colorlet{meanCol}{orange!90!black} +\colorlet{adjCol}{blue!80!black} +\colorlet{fitCol}{purple} + +%% some vector stuff + +\newcommand{\rowvec}[1]{{\left[ #1 \right]}} +\newcommand{\transpose}[1]{{#1}^T} +\newcommand{\colvec}[1]{\transpose{\rowvec{#1}}} +\def\vec#1{\bm{#1}} +\def\mat#1{\bm{#1}} +\newcommand{\vecarray}[2][black]{ +\textcolor{#1}{ +\left[ +\begin{array}{r} + #2 +\end{array} +\right] +} +} + +\newenvironment{brackmat}% +{% +\left[ +\begin{matrix} +}{% +\end{matrix} +\right] +} + +%\newcommand{\D}[2]{\frac{\partial}{\partial #2}#1} + + + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% some hacks and kludges +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\def\lhd{\mathrel{\vartriangleleft}} +\def\unlhd{\mathrel{\trianglelefteq}} +\def\Box{\mathrel{\square}} +\def\QED{\hfill\mbox{$\Box$}} + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% macros from schoening +% with modifications +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +%% relations + +\def\wiggle{\sim} +\def\wiggles{\wiggle} +\def\approxwiggle{\stackrel{\cdot}{\wiggle}} + +%% sets of numbers + +\def\WholeNumbers{\mbox{$\mathbb W$}} +\def\WholeNums{\mbox{$\mathbb W$}} +\def\Naturals{\mbox{$\mathbb N$}} +\def\NatNums{\mbox{$\mathbb N$}} +\def\natNums{\mbox{$\mathbb N$}} +\def\natNumbers{\mbox{$\mathbb N$}} +\def\NatNumbers{\mbox{$\mathbb N$}} +\def\Reals{\mbox{$\mathbb R$}} +\def\reals{\mbox{$\mathbb R$}} +\def\Jset{\mbox{$\mathbb J$}} +\def\Integers{\mbox{$\mathbb Z$}} +\def\integers{\mbox{$\mathbb Z$}} +\def\rationals{\mbox{$\mathbb Q$}} +\def\Rationals{\mbox{$\mathbb Q$}} + + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% misc +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\def\dom#1{\mbox{dom}} + +\def\seq#1#2{\{#1_{#2}\}_{#2=1}^{\infty}} +\def\seqg#1#2#3{\{#1\}_{#2=#3}^{\infty}} + + +\gdef\makemath#1{\ifmmode #1 \else $ #1 $\fi} +\def\ignore[1]{\relax} +\def\ds{\displaystyle} + +\def\varp{\varphi} + +%\def\options#1#2{ {\tt [ #1 ]/[ #2 ]} } +\def\abs#1{{\mid #1 \mid}} + +% this interferes with the definition in amstheorem and ntheorem +%\def\qed{$\Box$} + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% 'function-like' defs +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +\def\floor#1{\lfloor #1 \rfloor} +\def\ceiling#1{\lceil #1 \rceil} +\def\falling#1#2{#1^{\underline{#2}}} +\def\rising#1#2{#1^{\overline{#2}}} +\def\pair#1{\langle #1 \rangle } +\def\Pair#1{\left\langle #1 \right\rangle } +\def\tuple#1{{\langle {#1} \rangle}} +\def\length#1{\vert #1 \vert} +\def\boolval#1{\lbrack\!\lbrack #1 \rbrack\!\rbrack} +\def\set#1{\{#1\}} +\def\ctblset#1#2{\set{ #1_{#2} \mid #2 \in \omega }} +\def\card#1{\vert #1 \vert} +\def\size#1{\vert #1 \vert} +\def\norm#1{\| #1 \|} +\def\ket#1{{|{#1} \rangle}} +\def\bra#1{{\langle {#1}|}} +\def\braket#1#2{{\langle {#1} \mid {#2} \rangle}} +%\newcommand{\complement}[1]{\makemath{#1^{c}}} +\newcommand{\comp}[1]{{#1}^{c}} + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% symbols for manipulating strings, sets and functions +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\def\garthCases#1{\left\{\mbox{\begin{tabular}{@{$}l@{$\qquad}l} #1 \end{tabular}}\right.} + +%for example: +% +%$$ A = \garthCases{ \sigma & if something,\\ \sigma' & otherwise. } $$ +% +% -- garth + +\def\powerset{\mbox{$\cal P$}} +\def\powerSet{\mbox{$\cal P$}} +\def\EmptySet{\emptyset} +\def\emptySet{\emptyset} +\def\emptystring{\lambda} +\def\emptyString{\emptystring} +\def\concat{^\frown} +\def\substring{\sqsubset} +\def\supstring{\sqsupset} +\def\substringeq{\sqsubseteq} +\def\supstringeq{\sqsupseteq} +%\def\substringnoteq{{\sqsubset \atop \not=}} +%\def\supstringnoteq{{\sqsupset \atop \not=}} +%\def\subsetnoteq{{\subset \atop \not=}} +%\def\supsetnoteq{{\supset \atop \not=}} +\def\substringnoteq{\substring} +\def\supstringnoteq{\substring} +\def\subsetnoteq{\subset} +\def\supsetnoteq{\supset} + +\def\intersect{\cap} +\def\Intersect{\bigcap} +\def\union{\cup} +\def\Union{\bigcup} +\def\symdif{\bigtriangleup} +\def\setminus{-} + +\def\compose{\circ} +\def\restricted{\makemath{|\!\grave{\;}}} +\def\restrictedto{{|\!\grave{\;}}} + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% logic stuff +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\def\proves{\vdash} +\def\provedby{\dashv} + +\def\implies{\Longrightarrow} + +\def\succ{{\rm succ}} +\def\divg{{\!\uparrow}} +\def\conv{{\!\downarrow}} +\def\domain{{\rm domain}} +\def\range{{\rm range}} + +\def\setof#1{{\left\{{#1}\right\}}} + +\newcommand{\tand}{\mbox{\ and\ }} +\newcommand{\tor}{\mbox{\ or\ }} + diff --git a/StudentGuide/r-interface.jpg b/StudentGuide/r-interface.jpg new file mode 100644 index 0000000..cc903a3 Binary files /dev/null and b/StudentGuide/r-interface.jpg differ diff --git a/StudentGuide/r-markdown.jpg b/StudentGuide/r-markdown.jpg new file mode 100644 index 0000000..f3133bb Binary files /dev/null and b/StudentGuide/r-markdown.jpg differ diff --git a/StudentGuide/rstudio-init.png b/StudentGuide/rstudio-init.png new file mode 100644 index 0000000..559ee3e Binary files /dev/null and b/StudentGuide/rstudio-init.png differ diff --git a/StudentGuide/rstudio-login.png b/StudentGuide/rstudio-login.png new file mode 100644 index 0000000..6832de4 Binary files /dev/null and b/StudentGuide/rstudio-login.png differ diff --git a/Traditional/.gitignore b/Traditional/.gitignore deleted file mode 100644 index 1e0c771..0000000 --- a/Traditional/.gitignore +++ /dev/null @@ -1,10 +0,0 @@ -*.aux -*.bbl -*.blg -*.log -*.notes -*.synctex.gz -*-concordance.tex -*.tex -figure -.DS_Store diff --git a/Traditional/Core.Rnw b/Traditional/Core.Rnw deleted file mode 100644 index 8e5911a..0000000 --- a/Traditional/Core.Rnw +++ /dev/null @@ -1,2124 +0,0 @@ -<>= -opts_chunk$set( fig.path="figure/Core-fig-" ) -set_parent('Master-Core.Rnw') -set.seed(123) -@ - - -<>= -require(fastR) -@ - -\chapter{Introduction} - - -In this monograph, we briefly review the commands and functions needed -to analyze data from introductory and second courses in statistics. This is intended to complement -the \emph{Start Teaching with R} and \emph{Start Modeling with R} books that are freely available as part of the -\pkg{mosaic} package. - -Most of our examples will use data from the HELP (Health Evaluation and Linkage to Primary -Care) study: a randomized clinical trial of a novel -way to link at-risk subjects with primary care. More information on the -dataset can be found in chapter \ref{sec:help}. - - -Since the selection and order of topics can vary greatly from -textbook to textbook and instructor to instructor, we have chosen to -organize this material by the kind of data being analyzed. This should make -it straightforward to find what you are looking for even if you present -things in a different order. This is also a good organizational template -to give your students to help them keep straight ``what to do when". - -Some data management is needed by students (and more by instructors). This -material is reviewed in chapter \ref{sec:manipulatingData}. - -\myindex{vignettes}% -This work leverages initiatives undertaken by Project MOSAIC (\url{http://www.mosaic-web.org}), an NSF-funded effort to improve the teaching of statistics, calculus, science and computing in the undergraduate curriculum. In particular, we utilize the -\pkg{mosaic} package, -which was written to simplify the use of \R for introductory statistics courses, and the -\pkg{mosaicData} package which includes a number of data sets. -A short summary of the \R\ commands needed to teach introductory statistics can be found in -the mosaic package vignette -(\url{http://cran.r-project.org/web/packages/mosaic/vignettes/mosaic-resources.pdf}). - -Other related resources from Project MOSAIC may be helpful, including an annotated set of examples -from the -sixth edition of -Moore, McCabe and Craig's \emph{Introduction to the Practice of Statistics}\cite{moor:mcca:2007} (see \url{http://www.amherst.edu/~nhorton/ips6e}), -the second and third editions of the \emph{Statistical Sleuth}\cite{Sleuth2} (see \url{http://www.amherst.edu/~nhorton/sleuth}), -and \emph{Statistics: Unlocking the Power of Data} by Lock et al (see \url{https://github.com/rpruim/Lock5withR}). - -\myindex{installing packages}% -\Rindex{install.packages()}% -To use a package within R, it must be installed (one time), and loaded (each session). The -\pkg{mosaic} and \pkg{mosaicData} packages can be installed using the following commands: -<>= -install.packages("mosaic") # note the quotation marks -@ -\TeachingTip{\Rstudio\ features a simplified package installation tab (on the bottom right panel).} -The {\tt \#} character is a comment in R, and all text after that on the -current line is ignored. - -\myindex{loading packages}% -\Rindex{require()}% -Once the package is installed (one time only), it can be loaded by running the command: -<>= -require(mosaic) -require(mosaicData) -@ - -\myindex{reproducible analysis}% -\myindex{markdown}% -\myindex{knitr}% -\marginnote{Using Markdown or \pkg{knitr}/\LaTeX\ requires that -the \pkg{markdown} package be installed on your system.}% -The RMarkdown system provides a simple markup language and renders the -results in PDF, Word, or HTML. -This allows students to undertake their analyses using a workflow that -facilitates ``reproducibility'' and avoids cut and paste errors. - -We typically introduce students to RMarkdown very early, -requiring students to use it for assignments and reports\cite{baum:2014}. -\TeachingTip{The \pkg{knitr}/\LaTeX\ system -allows users to combine \R\ and \LaTeX\ in the same document. The -reward for learning this more complicated system is much finer control -over the output format.}% - -Depending on the level of the course, students can use either of these for homework and projects. - - - - - -\chapter{One Quantitative Variable} - -\section{Numerical summaries} - -\R\ includes a number of commands to numerically summarize variables. -These include the capability of calculating the mean, standard deviation, -variance, median, five number summary, interquartile range (IQR) as well as arbitrary quantiles. We will -illustrate these using the CESD (Center for Epidemiologic Studies--Depression) -measure of depressive symptoms (which takes on values between 0 and 60, with higher -scores indicating more depressive symptoms). - -To improve the legibility of output, -we will also set the default number of digits to display to a more reasonable -level (see \function{?options} for more configuration possibilities). - -\myindex{HELPrct dataset}% -\Rindex{options()}% -\Rindex{require()}% -\Rindex{mosaic package}% -<>= -require(mosaic) -require(mosaicData) -options(digits=3) -mean(~ cesd, data=HELPrct) -@ - -\myindex{Start Teaching with R@\emph{Start Teaching with R}}% -\myindex{Teaching with R@\emph{Teaching with R}}% -\myindex{Start Modeling with R@\emph{Start Modeling with R}}% -\myindex{Modeling with R@\emph{Modeling with R}}% -Note that the \function{mean()} function in the \pkg{mosaic} package supports a modeling language -common to \pkg{lattice} graphics and linear models (e.g., \function{lm()}). We will use -commands using variants of this modeling language throughout this document. Those already familiar with \R\ may be surprised by the form of this command. -\DiggingDeeper{The \emph{Start Modeling with R} companion book will be helpful if you are unfamiliar with the -modeling language. The \emph{Start Teaching with R} also provides useful guidance in getting started.} - -\Rindex{with()}% -\Rindex{mean()}% -The same output could be -created using the following commands (though we will use the MOSAIC versions when available). -<<>>= -with(HELPrct, mean(cesd)) -mean(HELPrct$cesd) -@ -\Rindex{sd()}% -\Rindex{var()}% -Similar functionality exists for other summary statistics. -<>= -sd(~ cesd, data=HELPrct) -@ -<>= -sd(~ cesd, data=HELPrct)^2 -var(~ cesd, data=HELPrct) -@ - -It is also straightforward to calculate quantiles of the distribution. - -\myindex{quantiles}% -\Rindex{median()}% -<>= -median(~ cesd, data=HELPrct) -@ - -By default, the -\function{quantile()} function displays the quartiles, but can be given -a vector of quantiles to display. -\Rindex{quantile()}% -\Caution{Not all commands have been upgraded to -support the formula interface. For these functions, variables within dataframes must be accessed using \function{with()} or the \$ operator.} -<>= -with(HELPrct, quantile(cesd)) -with(HELPrct, quantile(cesd, c(.025, .975))) -@ - -\Rindex{favstats()}% -Finally, the \function{favstats()} -function in the \pkg{mosaic} package provides a concise summary of -many useful statistics. -<<>>= -favstats(~ cesd, data=HELPrct) -@ - -\section{Graphical summaries} -The \function{histogram()} function is used to create a histogram. -Here we use the formula interface (as discussed in the \emph{Start Modeling with R} book) to -specify that we want a histogram of the CESD scores. - -\Rindex{histogram()}% -\vspace{-4mm} -\begin{center} -<>= -histogram(~ cesd, data=HELPrct) -@ -\end{center} - - -\Rindex{tally()}% -\Rindex{format option}% -In the \variable{HELPrct} dataset, approximately one quarter of the subjects are female. -<<>>= -tally(~ sex, data=HELPrct) -tally(~ sex, format="percent", data=HELPrct) -@ - -It is straightforward to restrict our attention to just the female subjects. -If we are going to do many things with a subset of our data, it may be easiest -to make a new dataframe containing only the cases we are interested in. -The \function{filter()} function in the \pkg{dplyr} package can be used to generate a new dataframe containing -just the women or just the men (see also section \ref{sec:subsets}). Once this is created, the -the \function{stem()} function is used to create a stem and leaf plot. -\Caution{Note that the tests for equality use \emph{two} equal signs} -\Rindex{stem()}% -\Rindex{filter()}% -\Rindex{dplyr package}% -<>= -female <- filter(HELPrct, sex=='female') -male <- filter(HELPrct, sex=='male') -with(female, stem(cesd)) -@ - -\Rindex{dplyr package}% -\Rindex{tidyr package}% - -Subsets can also be generated and used ``on the fly" (this time including -an overlaid normal density): -\Rindex{fit option}% -<>= -histogram(~ cesd, fit="normal", - data=filter(HELPrct, sex=='female')) -@ - -Alternatively, we can make side-by-side plots to compare multiple subsets. -<>= -histogram(~ cesd | sex, data=HELPrct) -@ - -The layout can be rearranged. -\Rindex{layout option}% -\begin{center} -<>= -histogram(~ cesd | sex, layout=c(1, 2), data=HELPrct) -@ -\end{center} -\begin{problem} -Using the \dataframe{HELPrct} dataset, -create side-by-side histograms of the CESD scores by substance abuse -group, just for the male subjects, with an overlaid normal density. -\end{problem}% -\begin{solution} -<>= -histogram(~ cesd | substance, fit="normal", - data=filter(HELPrct, sex=='male')) -@ -\end{solution}% -We can control the number of bins in a number of ways. These can be specified -as the total number. -\Rindex{nint option}% -\begin{center} -<>= -histogram(~ cesd, nint=20, data=female) -@ -\end{center} -The width of the bins can be specified. -\Rindex{width option}% -\begin{center} -<>= -histogram(~ cesd, width=1, data=female) -@ -\end{center} - -The \function{dotPlot()} function is used to create a dotplot -for a smaller subset of subjects (homeless females). We also demonstrate -how to change the x-axis label. -\Rindex{dotPlot()}% -<>= -dotPlot(~ cesd, xlab="CESD score", - data=filter(HELPrct, (sex=="female") & (homeless=="homeless"))) -@ - - -\section{Density curves} - -\FoodForThought{Density plots are also sensitive to certain choices. If your density plot -is too jagged or too smooth, try adjusting the \option{adjust} argument (larger than 1 for -smoother plots, less than 1 for more jagged plots).} -One disadvantage of histograms is that they can be sensitive to the choice of the -number of bins. Another display to consider is a density curve. - -Here we adorn a density plot with some gratuitous additions to -demonstrate how to build up a graphic for pedagogical purposes. -We add some text, a superimposed normal density as well as -a vertical line. A variety of line types and colors can be specified, -as well as line widths. - -\DiggingDeeper{The \function{plotFun()} function can also be used to annotate plots (see -section \ref{sec:plotFun}).} -\begin{center} -\Rindex{densityplot()}% -\Rindex{ladd()}% -\Rindex{panel.mathdensity()}% -\Rindex{panel.abline()}% -\Rindex{col option}% -\Rindex{grid.text()}% -\Rindex{lty option}% -\Rindex{lwd option}% -<>= -densityplot(~ cesd, data=female) -ladd(grid.text(x=0.2, y=0.8, 'only females')) -ladd(panel.mathdensity(args=list(mean=mean(cesd), - sd=sd(cesd)), col="red"), data=female) -ladd(panel.abline(v=60, lty=2, lwd=2, col="grey")) -@ -\end{center} - -\section{Frequency polygons} -\myindex{polygons}% - -A third option is a frequency polygon, where the graph is created by joining the midpoints of the top of the bars of a histogram. -\Rindex{freqpolygon()}% -\begin{center} -<>= -freqpolygon(~ cesd, data=female) -@ -\end{center} - - -\section{Normal distributions} - -\FoodForThought{\code{x} is for eXtra.}% -The most famous density curve is a normal distribution. The \function{xpnorm()} function -displays the probability that a random variable is less than the first argument, for a -normal distribution with mean given by the second argument and standard deviation by the -third. -More information about probability distributions can -be found in section \ref{sec:probability}. -\begin{center} -<>= -xpnorm(1.96, mean=0, sd=1) -@ -\end{center} - -\section{Inference for a single sample} -\label{sec:bootstrapsing} - -\Rindex{t.test()}% -\Rindex{confint()}% - -We can calculate a 95\% confidence interval for the mean CESD -score for females by using a t-test: -<>= -t.test(~ cesd, data=female) -confint(t.test(~ cesd, data=female)) -@ - -\DiggingDeeper{More details and examples can be found in the -\pkg{mosaic} package Resampling Vignette.} -\myindex{bootstrapping}% -\myindex{resampling}% -But it's also straightforward to calculate this using a bootstrap. -The statistic that we want to resample is the mean. -<>= -mean(~ cesd, data=female) -@ - -One resampling trial can be carried out: -\TeachingTip{Here we sample with replacement from the original dataframe, -creating a resampled dataframe with the same number of rows.} -\Rindex{resample()}% -<>= -mean(~ cesd, data=resample(female)) -@ -\TeachingTip{Even though a single trial is of little use, it's smart having -students do the calculation to show that they are (usually!) getting a different -result than without resampling.} - -Another will yield different results: -<<>>= -mean(~ cesd, data=resample(female)) -@ - -Now conduct 1000 resampling trials, saving the results in an object -called \texttt{trials}: -\Rindex{do()}% -\Rindex{qdata()}% -<>= -trials = do(1000) * mean(~ cesd, data=resample(female)) -qdata(c(.025, .975), ~ result, data=trials) -@ - -\chapter{One Categorical Variable} - -\section{Numerical summaries} - -\myindex{categorical variables} -\myindex{contingency tables} -\myindex{tables} -The \function{tally()} function can be used to calculate -counts, percentages and proportions for a categorical variable. - -\Rindex{tally()}% -\Rindex{margins option}% -<>= -tally(~ homeless, data=HELPrct) -tally(~ homeless, margins=TRUE, data=HELPrct) -tally(~ homeless, format="percent", data=HELPrct) -tally(~ homeless, format="proportion", data=HELPrct) -@ -\DiggingDeeper{The \emph{Start Modeling with R} companion book will be helpful if you are unfamiliar with the -modeling language. The \emph{Start Teaching with R} also provides useful guidance in getting started.} - -\section{The binomial test} - -\myindex{binomial test}% -\Rindex{binom.test()}% -An exact confidence interval for a proportion (as well as a test of the null -hypothesis that the population proportion is equal to a particular value [by default 0.5]) can be calculated -using the \function{binom.test()} function. -The standard \function{binom.test()} requires us to tabulate. -<<>>= -binom.test(209, 209 + 244) -@ -The \pkg{mosaic} package provides a formula interface that avoids the need to pre-tally -the data. -<>= -result <- binom.test(~ (homeless=="homeless"), HELPrct) -result -@ - -As is generally the case with commands of this sort, -there are a number of useful quantities available from -the object returned by the function. -<<>>= -names(result) -@ -These can be extracted using the {\tt \$} operator or an extractor function. -For example, the user can extract the confidence interval or p-value. -\Rindex{confint()}% -\Rindex{pval()}% -\Rindex{print()}% -<>= -result$statistic -confint(result) -pval(result) -@ -\DiggingDeeper{Most of the objects in \R\ have a \function{print()} -method. So when we get \code{result}, what we are seeing displayed in the console is -\code{print(result)}. There may be a good deal of additional information -lurking inside the object itself. To make matter even more complicated, some -objects are returned \emph{invisibly}, so nothing prints. You can still assign -the returned object to a variable and process it later, even if nothing shows up -on the screen. This is sometimes helpful for \pkg{lattice} graphics functions.}% - - -\section{The proportion test} - -A similar interval and test can be calculated using \function{prop.test()}. -\Rindex{prop.test()}% -\Rindex{correct option}% -<>= -tally(~ homeless, data=HELPrct) -prop.test(~ (homeless=="homeless"), correct=FALSE, data=HELPrct) -@ -It also accepts summarized data, the way \function{binom.test()} does. -\InstructorNote{\function{prop.test()} calculates a Chi-squared statistic. -Most introductory texts use a $z$-statistic. They are mathematically equivalent -in terms of inferential statements, but -you may need to address the discrepancy with your students.}% -<<>>= -prop.test(209, 209 + 244, correct=FALSE) -@ - -\section{Goodness of fit tests} - -A variety of goodness of fit tests can be calculated against a reference -distribution. For the HELP data, we could test the null hypothesis that there is an equal -proportion of subjects in each substance abuse group back in the original populations. - -\Caution{The \option{margins=FALSE} option is the default for the \function{tally} function.} -<>= -tally(~ substance, format="percent", data=HELPrct) -observed <- tally(~ substance, data=HELPrct) -observed -@ -\Rindex{chisq.test()}% -<>= -p <- c(1/3, 1/3, 1/3) # equivalent to rep(1/3, 3) -chisq.test(observed, p=p) -total <- sum(observed); total -expected <- total*p; expected -@ - -We can also calculate the $\chi^2$ statistic manually, as a function of observed and expected values. - -\TeachingTip{We don't have students do much if any manual calculations in our courses.}% -\Rindex{sum()}% -\Rindex{pchisq()}% -<>= -chisq <- sum((observed - expected)^2/(expected)); chisq -1 - pchisq(chisq, df=2) -@ -\TeachingTip{The \function{pchisq} function calculates the probability that a $\chi^2$ random variable with \function{df} degrees is freedom is less than or equal to a given value. Here we calculate the complement to find the area to the right of the observed Chi-square statistic.}% - -Alternatively, the \pkg{mosaic} package provides a version of \function{chisq.test()} with -more verbose output. -\FoodForThought{\code{x} is for eXtra.} -<<>>= -xchisq.test(observed, p=p) -@ -\TeachingTip{Objects in the workspace that are no longer needed can be removed.} -<<>>= -# clean up variables no longer needed -rm(observed, p, total, chisq) -@ - - -\chapter{Two Quantitative Variables} - -\section{Scatterplots} -\myindex{scatterplots}% -\myindex{lowess}% -\myindex{smoothers}% -\myindex{linearity}% - -We always encourage students to start any analysis by graphing their data. -Here we augment a scatterplot -of the CESD (a measure of depressive symptoms, higher scores indicate more symptoms) and the MCS (mental component score from the SF-36, where higher scores indicate better functioning) for female subjects -with a lowess (locally weighted scatterplot smoother) line, using a circle -as the plotting character and slightly thicker line. - -\InstructorNote{The lowess line can help to assess linearity of a relationship. This is added by specifying both points (using `p') and a lowess smoother.} -\Rindex{xyplot()}% -\Rindex{pch option}% -\Rindex{cex option}% -\Rindex{lwd option}% -\Rindex{type option}% -\begin{center} -<>= -females <- filter(HELPrct, female==1) -xyplot(cesd ~ mcs, type=c("p","smooth"), pch=1, cex=0.6, - lwd=3, data=females) -@ -\end{center} -\DiggingDeeper{The \emph{Start Modeling with R} companion book will be helpful if you are unfamiliar with the -modeling language. The \emph{Start Teaching with R} also provides useful guidance in getting started.} - -It's straightforward to plot something besides a character in a scatterplot. -In this example, the \dataframe{USArrests} can be used to plot the association -between murder and assault rates, with the state name displayed. This -requires a panel function to be written. -\Rindex{function()}% -\Rindex{panel.labels()}% -\Rindex{panel.text()}% -\Rindex{rownames()}% -<>= -panel.labels <- function(x, y, labels='x',...) { - panel.text(x, y, labels, cex=0.4, ...) -} -xyplot(Murder ~ Assault, panel=panel.labels, - labels=rownames(USArrests), data=USArrests) -@ - - - -\section{Correlation} - -<>= -#detach(package:MASS) # where did MASS come from? (histogram in mosaic XX) -@ - - -Correlations can be calculated for a pair of variables, or for a matrix of variables. -\myindex{correlation}% -\Rindex{cor()}% -<<>>= -cor(cesd, mcs, data=females) -smallHELP = select(females, cesd, mcs, pcs) -cor(smallHELP) -@ -\myindex{Pearson correlation}% -\myindex{Spearman correlation}% - -By default, Pearson correlations are provided. Other variants (e.g., Spearman) can be specified using the -\option{method} option. -<<>>= -cor(cesd, mcs, method="spearman", data=females) -@ - -\section{Pairs plots} -\myindex{pairs plot}% -\myindex{scatterplot matrix}% - -A pairs plot (scatterplot matrix) can be calculated for each pair of a set of variables. -\TeachingTip{The \pkg{GGally} package has support for more elaborate pairs plots.} -\Rindex{splom()}% -<>= -splom(smallHELP) -@ - -\section{Simple linear regression} - -\InstructorNote{We tend to introduce linear regression -early in our courses, as a purely descriptive technique.} -\myindex{linear regression}% -\myindex{regression}% - -Linear regression models are described in detail in \emph{Start Modeling with R}. -These use the same formula interface introduced previously for numerical and graphical -summaries -to specify the outcome -and predictors. Here we consider fitting the model \model{\variable{cesd}}{\variable{mcs}}. - - -\Rindex{lm()}% -\Rindex{coef()}% -<<>>= -cesdmodel <- lm(cesd ~ mcs, data=females) -coef(cesdmodel) -@ -\InstructorNote{It's important to pick good names for modeling -objects. Here the output of \function{lm} is saved as \code{cesdmodel}, -which denotes that it is a regression model of depressive symptom -scores.} - -To simplify the output, we turn off the option to display significance stars. -\myindex{significance stars}% -\Rindex{summary()}% -\Rindex{confint()}% -\Rindex{rsquared()}% -\Rindex{coef()}% -<<>>= -options(show.signif.stars=FALSE) -coef(cesdmodel) -summary(cesdmodel) -coef(summary(cesdmodel)) -confint(cesdmodel) -rsquared(cesdmodel) -@ - - -\Rindex{class()}% -<>= -class(cesdmodel) -@ -The return value from \function{lm()} is a linear model object. -A number of functions can operate on these objects, as -seen previously with \function{coef()}. -The function \function{residuals()} returns a -vector of the residuals. -\Rindex{residuals()}% -\FoodForThought{The function \function{residuals()} can be abbreviated -\function{resid()}. Another useful function is \function{fitted()}, which -returns a vector of predicted values.} - - -\Rindex{density option}% -\begin{center} -<>= -histogram(~ residuals(cesdmodel), density=TRUE) -@ -\end{center} -\Rindex{qqmath()}% -\begin{center} -<>= -qqmath(~ resid(cesdmodel)) -@ -\end{center} -\Rindex{alpha option}% -\begin{center} -<>= -xyplot(resid(cesdmodel) ~ fitted(cesdmodel), type=c("p", "smooth", "r"), alpha=0.5, cex=0.3, pch=20) -@ -\end{center} - -The \function{mplot()} function can facilitate creating a variety of useful plots, including the same residuals vs. fitted scatterplots, by specifying the \option{which=1} option. -\Rindex{mplot()}% -\Rindex{which option}% -<>= -mplot(cesdmodel, which=1) -@ - -It can also generate a -normal quantile-quantile plot (\option{which=2}), -<>= -mplot(cesdmodel, which=2) -@ - -\myindex{scale versus location}% -scale vs.\,location, -<>= -mplot(cesdmodel, which=3) -@ - -\myindex{Cook's distance}% -Cook's distance by observation number, -<>= -mplot(cesdmodel, which=4) -@ - -\myindex{leverage}% -residuals vs.\,leverage, and -<>= -mplot(cesdmodel, which=5) -@ - -Cook's distance vs. leverage. -<>= -mplot(cesdmodel, which=6) -@ - -\myindex{prediction bands}% -\Rindex{panel.lmbands()}% -\Rindex{band.lwd option}% -Prediction bands can be added to a plot using the \function{panel.lmbands()} function. -\begin{center} -<<>>= -xyplot(cesd ~ mcs, panel=panel.lmbands, cex=0.2, band.lwd=2, data=HELPrct) -@ -\end{center} - -\begin{problem} -Using the \dataframe{HELPrct} dataset, fit a simple linear regression model -predicting the number of drinks per day as a function of the mental -component score. -This model can be specified using the formula: -\model{\variable{i1}}{\variable{mcs}}. -Assess the distribution of the residuals for this model. -\end{problem} - - -\chapter{Two Categorical Variables} - - -\section{Cross classification tables} -\label{sec:cross} - -\myindex{cross classification tables}% -\myindex{contingency tables}% -\myindex{tables}% - -Cross classification (two-way or $R$ by $C$) tables can be constructed for -two (or more) categorical variables. Here we consider the contingency table -for homeless status (homeless one or more nights in the past 6 months or housed) -and sex. - -<>= -tally(~ homeless + sex, margins=FALSE, data=HELPrct) -@ - -We can also calculate column percentages: -\Rindex{tally()}% -<>= -tally(~ sex | homeless, margins=TRUE, format="percent", - data=HELPrct) -@ - -We can calculate the odds ratio directly from the table: -<>= -OR = (40/169)/(67/177); OR -@ - -The -\pkg{mosaic} package has a function which will calculate odds ratios: -\Rindex{oddsRatio()}% -<>= -oddsRatio(tally(~ (homeless=="housed") + sex, margins=FALSE, - data=HELPrct)) -@ - -Graphical summaries of cross classification tables may be helpful in visualizing -associations. Mosaic plots are one example, where the total area (all observations) is proportional to one. -\Caution{The jury is still out -regarding the utility of mosaic plots, relative to the low data to ink ratio\cite{Tufte:2001:Visual}. But we have found them to be helpful to reinforce understanding of a two way contingency table.}% -Here we see that males tend to be over-represented -amongst the homeless subjects (as represented by the horizontal line which is higher for -the homeless rather than the housed). -\FoodForThought{The \function{mosaic()} function -in the \pkg{vcd} package also makes mosaic plots.} -\Rindex{mosaicplot()}% -\begin{center} -<>= -mytab <- tally(~ homeless + sex, margins=FALSE, - data=HELPrct) -mosaicplot(mytab) -@ -\end{center} - -\section{Chi-squared tests} - -\Rindex{chisq.test()}% -<>= -chisq.test(tally(~ homeless + sex, margins=FALSE, - data=HELPrct), correct=FALSE) -@ - -There is a statistically significant association found: it is unlikely that we would observe -an association this strong if homeless status and sex were independent in the -population. - -When a student finds a significant association, -it's important for them to be able to interpret this in the context of the problem. -The \function{xchisq.test()} function provides additional details (observed, expected, contribution to statistic, and residual) to help with this process. -\FoodForThought{\code{x} is for eXtra.} - -\Rindex{xchisq.test()}% -<>= -xchisq.test(tally(~homeless + sex, margins=FALSE, - data=HELPrct), correct=FALSE) -@ - -We observe that there are fewer homeless women, and more homeless men that would be expected. - -\section{Fisher's exact test} -\myindex{Fisher's exact test}% - -An exact test can also be calculated. This is computationally straightforward for 2 by 2 -tables. Options to help constrain the size of the problem for larger tables exist -(see \verb!?fisher.test()!). - -\DiggingDeeper{Note the different estimate of the odds ratio from that seen in section \ref{sec:cross}. -The \function{fisher.test()} function uses a different estimator (and different interval based -on the profile likelihood).} -\Rindex{fisher.test()}% -<>= -fisher.test(tally(~homeless + sex, margins=FALSE, - data=HELPrct)) -@ - -\chapter{Quantitative Response to a Categorical Predictor} - -\section{A dichotomous predictor: numerical and graphical summaries} -Here we will compare the distributions of CESD scores by sex. - -The \function{mean()} function can be used to calculate the mean CESD score -separately for males and females. -<>= -mean(cesd ~ sex, data=HELPrct) -@ - -The \function{favstats()} function can provide more statistics by group. -<>= -favstats(cesd ~ sex, data=HELPrct) -@ - - -Boxplots are a particularly helpful graphical display to compare distributions. -The \function{bwplot()} function can be used to display the boxplots for the -CESD scores separately by sex. We see from both the numerical and graphical -summaries that women tend to have slightly higher CESD scores than men. - -\FoodForThought{Although we usually put explanatory variables along the horizontal axis, -page layout sometimes makes the other orientation preferable for these plots.} -%\vspace{-8mm} -\begin{center} -<>= -bwplot(sex ~ cesd, data=HELPrct) -@ -\end{center} - -When sample sizes are small, there is no reason to summarize with a boxplot -since \function{xyplot()} can handle categorical predictors. -Even with 10--20 observations in a group, a scatter plot is often quite readable. -Setting the alpha level helps detect multiple observations with the same value. -\FoodForThought{One of us once saw a biologist proudly present -side-by-side boxplots. Thinking a major victory had been won, he naively -asked how many observations were in each group. ``Four,'' replied the -biologist.} -\begin{center} -<>= -xyplot(sex ~ length, KidsFeet, alpha=.6, cex=1.4) -@ -\end{center} - -\section{A dichotomous predictor: two-sample t} - -The Student's two sample t-test can be run without (default) or with an equal variance assumption. -<>= -t.test(cesd ~ sex, var.equal=FALSE, data=HELPrct) -@ -We see that there is a statistically significant difference between the two groups. - -We can repeat using the equal variance assumption. -<>= -t.test(cesd ~ sex, var.equal=TRUE, data=HELPrct) -@ - -The groups can also be compared using the \function{lm()} function (also with an equal variance assumption). -<<>>= -summary(lm(cesd ~ sex, data=HELPrct)) -@ - -\TeachingTip{While it requires use of the equal variance assumption, the \function{lm} function is part of a much more flexible modeling framework (while \function{t.test} is essentially a dead end).}% - - -\section{Non-parametric 2 group tests} - -The same conclusion is reached using a non-parametric (Wilcoxon rank sum) test. - -<>= -wilcox.test(cesd ~ sex, data=HELPrct) -@ - - -\section{Permutation test} -\myindex{resampling}% -\myindex{permutation test}% - - -Here we extend the methods introduced in section \ref{sec:bootstrapsing} to -undertake a two-sided test comparing the ages at baseline by gender. First we calculate the observed difference in means: -\Rindex{diffmean()}% -\Rindex{shuffle()}% -<<>>= -mean(age ~ sex, data=HELPrct) -test.stat <- diffmean(age ~ sex, data=HELPrct) -test.stat -@ -We can calculate the same statistic after shuffling the group labels: -<<>>= -do(1) * diffmean(age ~ shuffle(sex), data=HELPrct) -do(1) * diffmean(age ~ shuffle(sex), data=HELPrct) -do(3) * diffmean(age ~ shuffle(sex), data=HELPrct) -@ - -\DiggingDeeper{More details and examples can be found in the -\pkg{mosaic} package Resampling Vignette.} -\Rindex{xlim option}% -\Rindex{groups option}% -<>= -rtest.stats <- do(500) * diffmean(age ~ shuffle(sex), - data=HELPrct) -favstats(~ diffmean, data=rtest.stats) -histogram(~ diffmean, n=40, xlim=c(-6, 6), - groups=diffmean >= test.stat, pch=16, cex=.8, - data=rtest.stats) -ladd(panel.abline(v=test.stat, lwd=3, col="red")) -@ - -Here we don't see much evidence to contradict the null hypothesis that men and -women -have the same mean age in the population. - -\section{One-way ANOVA} -\myindex{one-way ANOVA}% -\myindex{analysis of variance}% - -Earlier comparisons were between two groups. We can also consider testing differences between -three or more groups using one-way ANOVA. Here we compare -CESD scores by primary substance of abuse (heroin, cocaine, or alcohol). - -\Rindex{bwplot()}% -\begin{center} -<>= -bwplot(cesd ~ substance, data=HELPrct) -@ -\end{center} - - -<>= -mean(cesd ~ substance, data=HELPrct) -@ -\Rindex{aov()}% -<>= -anovamod <- aov(cesd ~ substance, data=HELPrct) -summary(anovamod) -@ -While still high (scores of 16 or more are generally considered to be -``severe'' symptoms), the cocaine-involved group tend to have lower -scores than those whose primary substances are alcohol or heroin. -<>= -modintercept <- lm(cesd ~ 1, data=HELPrct) -modsubstance <- lm(cesd ~ substance, data=HELPrct) -@ - -The \function{anova()} command can summarize models. -\Rindex{anova()}% -<<>>= -anova(modsubstance) -@ - -It can also be used to formally -compare two (nested) models. -\myindex{model comparison}% -<>= -anova(modintercept, modsubstance) -@ - - -\section{Tukey's Honest Significant Differences} -\myindex{Tukey's HSD}% -\myindex{honest significant differences}% -\myindex{multiple comparisons}% - -There are a variety of multiple comparison procedures that can be -used after fitting an ANOVA model. One of these is Tukey's Honest -Significant Differences (HSD). Other options are available within the -\pkg{multcomp} package. - -<>= -favstats(cesd ~ substance, data=HELPrct) -@ -\Rindex{TukeyHSD()}% -\Rindex{factor()}% -\Rindex{levels option}% -\Rindex{labels option}% -\Rindex{mutate()}% -\Rindex{lm()}% -<>= -HELPrct <- mutate(HELPrct, subgrp = factor(substance, - levels=c("alcohol", "cocaine", "heroin"), - labels=c("A", "C", "H"))) -mod <- lm(cesd ~ subgrp, data=HELPrct) -HELPHSD <- TukeyHSD(mod, "subgrp") -HELPHSD -@ -\Rindex{mplot()}% -<>= -mplot(HELPHSD) -@ - -Again, we see that the cocaine group has significantly lower CESD scores -than either of the other two groups. - -\chapter{Categorical Response to a Quantitative Predictor} - -\section{Logistic regression} -\myindex{logistic regression}% - -Logistic regression is available using the \function{glm()} function, -which supports -a variety of -link functions and distributional forms for generalized linear models, including logistic regression. -\FoodForThought{The \function{glm()} function has argument \option{family}, which can take an option -\option{link}. The \code{logit} link is the default link for the binomial family, -so we don't need to specify it here. The more verbose usage would be \code{family=binomial(link=logit)}.}% -\Rindex{glm()}% -\Rindex{family option}% -\Rindex{exp()}% -<>= -logitmod <- glm(homeless ~ age + female, family=binomial, - data=HELPrct) -summary(logitmod) -exp(coef(logitmod)) -exp(confint(logitmod)) -@ - -We can compare two models (for multiple degree of freedom tests). For example, we -might be interested in the association of homeless status and age for each of the three substance groups. -\Rindex{anova()}% -\Rindex{test option}% -<>= -mymodsubage = glm((homeless=="homeless") ~ age + substance, - family=binomial, data=HELPrct) -mymodage = glm((homeless=="homeless") ~ age, family=binomial, - data=HELPrct) -summary(mymodsubage) -exp(coef(mymodsubage)) -anova(mymodage, mymodsubage, test="Chisq") -@ -We observe that the cocaine and heroin groups are significantly less likely to be homeless than alcohol involved subjects, after controlling for age. (A similar result is seen when considering just homeless status and substance.) - -<<>>= -tally(~ homeless | substance, format="percent", margins=TRUE, data=HELPrct) -@ - -\chapter{Survival Time Outcomes} - -\myindex{survival analysis}% -\myindex{failure time analysis}% -\myindex{time to event analysis}% -Extensive support for survival (time to event) analysis is available within the -\pkg{survival} package. - -\section{Kaplan-Meier plot} - -\myindex{Kaplan-Meier plot}% -\Rindex{survfit()}% -\Rindex{Surv()}% -\Rindex{conf.int option}% -\Rindex{xlab option}% -\begin{center} -<>= -require(survival) -fit <- survfit(Surv(dayslink, linkstatus) ~ treat, - data=HELPrct) -plot(fit, conf.int=FALSE, lty=1:2, lwd=2, - xlab="time (in days)", ylab="P(not linked)") -legend(20, 0.4, legend=c("Control", "Treatment"), - lty=c(1,2), lwd=2) -title("Product-Limit Survival Estimates (time to linkage)") -@ -\end{center} - -We see that the subjects in the treatment (Health Evaluation and Linkage to Primary Care clinic) were significantly more likely to -link to primary care (less likely to ``survive'') than the control (usual care) group. - -\section{Cox proportional hazards model} -\myindex{Cox proportional hazards model}% -\myindex{proportional hazards model}% -\Rindex{coxph()}% - -<>= -require(survival) -summary(coxph(Surv(dayslink, linkstatus) ~ age + substance, - data=HELPrct)) -@ - -Neither age or substance group was significantly associated with linkage to primary care. - - -\chapter{More than Two Variables} - -\section{Two (or more) way ANOVA} - -We can fit a two (or more) way ANOVA model, without or with an interaction, -using the same modeling syntax. -<>= -median(cesd ~ substance | sex, data=HELPrct) -bwplot(cesd ~ subgrp | sex, data=HELPrct) -@ -<>= -summary(aov(cesd ~ substance + sex, data=HELPrct)) -@ -<>= -summary(aov(cesd ~ substance * sex, data=HELPrct)) -@ -There's little evidence for the interaction, though there are statistically -significant main effects terms for \variable{substance} group and -\variable{sex}. - -<>= -xyplot(cesd ~ substance, groups=sex, - auto.key=list(columns=2, lines=TRUE, points=FALSE), type='a', - data=HELPrct) -@ -\Rindex{auto.key option} - - -\section{Multiple regression} -\myindex{multiple regression}% -\myindex{multivariate relationships}% - -Multiple regression is a logical extension of the prior commands, where -additional predictors are added. This allows students to start to try to disentangle -multivariate relationships. - -\InstructorNote{We tend to introduce multiple linear regression -early in our courses, as a purely descriptive technique, then return to it -regularly. The motivation for this is described at length in the companion volume -\emph{Start Modeling with R}.} - -Here we consider a model (parallel slopes) for depressive symptoms as a function of Mental Component Score (MCS), -age (in years) and sex of the subject. -<>= -lmnointeract <- lm(cesd ~ mcs + age + sex, data=HELPrct) -summary(lmnointeract) -@ -\myindex{interactions}% -We can also fit a model that includes an interaction between MCS and sex. -<>= -lminteract <- lm(cesd ~ mcs + age + sex + mcs:sex, data=HELPrct) -summary(lminteract) -anova(lminteract) -@ -<>= -anova(lmnointeract, lminteract) -@ - -There is little evidence for an interaction effect, so we drop -this from the model. - -\subsection{Visualizing the results from the regression} - -\label{sec:plotFun} -\Rindex{plotFun()}% -\Rindex{makeFun()}% -The \function{makeFun()} and \function{plotFun()} functions from the \pkg{mosaic} package -can be used to display the results from a regression model. For this example, we might -display the predicted CESD values for a range of MCS values a 36 year old male and female subject from the parallel -slopes (no interaction) model. -<>= -lmfunction = makeFun(lmnointeract) -@ - -\Rindex{xyplot()}% -\Rindex{auto.key option}% -\Rindex{ylab option}% -\Rindex{groups option}% -\Rindex{add option}% -We can now plot this function for male and female subjects over a range of MCS (mental component score) values, along -with the observed data for 36 year olds. -<>= -xyplot(cesd ~ mcs, groups=sex, auto.key=TRUE, - data=filter(HELPrct, age==36)) -plotFun(lmfunction(mcs, age=36, sex="male") ~ mcs, - xlim=c(0, 60), lwd=2, ylab="predicted CESD", add=TRUE) -plotFun(lmfunction(mcs, age=36, sex="female") ~ mcs, - xlim=c(0, 60), lty=2, lwd=3, add=TRUE) -@ - - -\subsection{Coefficient plots} - -\myindex{coefficient plots}% -It is sometimes useful to display a plot of the coefficients for a multiple regression model (along with their associated -confidence intervals). - -\Rindex{mplot()}% -<>= -mplot(lmnointeract, rows=-1, which=7) -@ - -\TeachingTip{Darker dots indicate regression coefficients where the 95\% confidence interval does not include the null hypothesis value of zero.} - -\Caution{Be careful when fitting regression models with missing values (see also section \ref{sec:miss}).} - -\subsection{Residual diagnostics} -\myindex{residual diagnostics} -\myindex{regression diagnostics} - -It's straightforward to undertake residual diagnostics for this model. We begin by adding the -fitted values and residuals to the dataset. -\TeachingTip{The \function{mplot} function can also be used to create these graphs.} -\Rindex{resid()}% -\Rindex{fitted()}% -\Rindex{abs()}% -\InstructorNote{Here we are adding two new variables into an existing dataset. It's often a good practice to give the resulting dataframe a new name.} -<>= -HELPrct = mutate(HELPrct, residuals = resid(lmnointeract), - pred = fitted(lmnointeract)) -@ -<>= -histogram(~ residuals, xlab="residuals", fit="normal", - data=HELPrct) -@ - -We can identify the subset of observations with extremely large residuals. - -\Rindex{abs()}% -<<>>= -filter(HELPrct, abs(residuals) > 25) -@ - -\Rindex{cex option}% -\Rindex{type option}% -<>= -xyplot(residuals ~ pred, ylab="residuals", cex=0.3, - xlab="predicted values", main="predicted vs. residuals", - type=c("p", "r", "smooth"), data=HELPrct) -@ -<>= -xyplot(residuals ~ mcs, xlab="mental component score", - ylab="residuals", cex=0.3, - type=c("p", "r", "smooth"), data=HELPrct) -@ - -The assumptions of normality, linearity and homoscedasticity seem reasonable here. -\begin{problem} -The \dataframe{RailTrail} dataset within the \pkg{mosaic} package includes the counts -of crossings of a rail trail in Northampton, Massachusetts for 90 days in 2005. -City officials are interested in understanding usage of the trail network, and -how it changes as a function of temperature and day of the week. -Describe the distribution of the variable \variable{avgtemp} in terms of its -center, spread and shape. -<>= -favstats(~ avgtemp, data=RailTrail) -densityplot(~ avgtemp, xlab="Average daily temp (degrees F)", - data=RailTrail) -@ -\end{problem} -\begin{solution} -The distribution of average temperature (in degrees Fahrenheit) is approximately normally -distributed with mean 57.4 degrees and standard deviation of 11.3 degrees. -\end{solution} -\begin{problem} -The \dataframe{RailTrail} dataset also includes a variable called \variable{cloudcover}. -Describe the distribution of this variable in terms of its -center, spread and shape. -\end{problem} -\begin{solution} -<<>>= -favstats(~ cloudcover, data=RailTrail) -densityplot(~ cloudcover, data=RailTrail) -@ -The distribution of cloud cover is ungainly (almost triangular), with increasing probability for more -cloudcover. The mean is 5.8 oktas (out of 10), with standard deviation of 3.2 oktas. It tends to be -cloudy in Northampton! -\end{solution} -\begin{problem} -The variable in the \dataframe{RailTrail} dataset that provides the daily count -of crossings is called \variable{volume}. -Describe the distribution of this variable in terms of its -center, spread and shape. -\end{problem} -\begin{solution} -<<>>= -favstats(~ volume, data=RailTrail) -densityplot(~ volume, xlab="# of crossings", data=RailTrail) -filter(RailTrail, volume > 700) -@ -The distribution of daily crossings is approximately normally -distributed with mean 375 crossings and standard deviation of 127 crossings. -There is one outlier with 736 crossings which occurred on a Monday holiday in the spring -(Memorial Day). -\end{solution} -\begin{problem} -The \dataframe{RailTrail} dataset also contains an indicator of whether the day was -a weekday (\variable{weekday==1}) or a weekend/holiday (\variable{weekday==0}). -Use \function{tally()} to describe the distribution of this categorical variable. -What percentage of the days are weekends/holidays? -\end{problem} -\begin{solution} -<<>>= -tally(~ weekday, data=RailTrail) -tally(~ weekday, format="percent", data=RailTrail) -@ -Just over 30\% of the days are weekends or holidays. -\end{solution} -\begin{problem} -Use side-by-side boxplots to compare the distribution of \variable{volume} by day type in the \dataframe{RailTrail} dataset. -Hint: you'll need to turn the numeric \variable{weekday} variable into a factor variable using \function{as.factor()}. -What do you conclude? -\end{problem} -\begin{solution} -<<>>= -bwplot(volume ~ as.factor(weekday), data=RailTrail) -@ -or -<<>>= -RailTrail = mutate(RailTrail, daytype = ifelse(weekday==1, "weekday", "weekend/holiday")) -bwplot(volume ~ daytype, data=RailTrail) -@ -We see that the weekend/holidays tend to have more users. -\end{solution} - -\begin{problem} -Use overlapping densityplots to compare the distribution of \variable{volume} by day type in the -\dataframe{RailTrail} dataset. -What do you conclude? -\end{problem} -\begin{solution} -<<>>= -densityplot(volume ~ weekday, auto.key=TRUE, data=RailTrail) -@ -We see that the weekend/holidays tend to have more users. -\end{solution} -\begin{problem} -Create a scatterplot of \variable{volume} as a function of \variable{avgtemp} using the \dataframe{RailTrail} dataset, along with a regression line and scatterplot -smoother (lowess curve). What do you observe about the relationship? -\end{problem} -\begin{solution} -<<>>= -xyplot(volume ~ avgtemp, xlab="average temperature (degrees F)", - type=c("p", "r", "smooth"), lwd=2, data=RailTrail) -@ -We see that there is a positive relationship between these two variables, but the association is -somewhat nonlinear (which makes sense as we wouldn't continue to predict an increase in usage when the -temperature becomes uncomfortably warm). -\end{solution} -\begin{problem} -Using the \dataframe{RailTrail} dataset, -fit a multiple regression model for \variable{volume} as a function of \variable{cloudcover}, \variable{avgtemp}, -\variable{weekday} and the interaction -between day type and average temperature. -Is there evidence to retain the interaction term at the $\alpha=0.05$ level? -\end{problem} -\begin{solution} -<<>>= -fm = lm(volume ~ cloudcover + avgtemp + weekday + avgtemp:weekday, data=RailTrail) -summary(fm) -@ -The interaction between average temperature and day-type is statistically significant (p=0.016). We -interpret this as being a steeper slope (stronger association) on weekdays rather than weekends. -(Perhaps on weekends/holidays people will tend to head out on the trails irrespective of the weather?) -\end{solution} -\begin{problem} -Use \function{makeFun()} to calculate the predicted number of crossings on a weekday with average -temperature 60 degrees and no clouds. Verify this calculation using the coefficients from the -model. -<<>>= -coef(fm) -@ -\end{problem} -\begin{solution} -<<>>= -myfun = makeFun(fm) -myfun(cloudcover=0, avgtemp=60, weekday=1) -@ -We expect just over 480 crossings on a day with these characteristics. -\end{solution} -\begin{problem} -Use \function{makeFun()} and \function{plotFun()} to display predicted values for the number of crossings -on weekdays and weekends/holidays for average temperatures between 30 and 80 degrees and a cloudy day -(\variable{cloudcover=10}). -\end{problem} -\begin{solution} -<<>>= -myfun = makeFun(fm) -xyplot(volume ~ avgtemp, data=RailTrail) -plotFun(myfun(cloudcover=10, avgtemp, weekday=0) ~ avgtemp, lwd=2, add=TRUE) -plotFun(myfun(cloudcover=10, avgtemp, weekday=1) ~ avgtemp, lty=2, lwd=3, add=TRUE) -@ -We -interpret this as being a steeper slope (stronger association) on weekdays rather than weekends. -(Perhaps on weekends/holidays people will tend to head out on the trails irrespective of the weather?) -\end{solution} -\begin{problem} -Using the multiple regression model, generate a histogram (with overlaid normal -density) to assess the normality of the residuals. -\end{problem} -\begin{solution} -<<>>= -histogram(~ resid(fm), fit="normal") -@ -The distribution is approximately normal. -\end{solution} -\begin{problem} -Using the same model generate a scatterplot of the residuals versus predicted values and comment -on the linearity of the model and assumption of equal variance. -\end{problem} -\begin{solution} -<<>>= -xyplot(resid(fm) ~ fitted(fm), type=c("p", "r", "smooth")) -@ -The association is fairly linear, except in the tails. There's some evidence that the variability -of the residuals increases with larger fitted values. -\end{solution} -\begin{problem} -Using the same model generate a scatterplot of the residuals versus average temperature and comment -on the linearity of the model and assumption of equal variance. -\end{problem} -\begin{solution} -<<>>= -xyplot(resid(fm) ~ avgtemp, type=c("p", "r", "smooth"), data=RailTrail) -@ -The association is somewhat non-linear. There's some evidence that the variability -of the residuals increases with larger fitted values. -\end{solution} - -\chapter{Probability Distributions and Random Variables} - -\label{sec:DiscreteDistributions} -\label{sec:probability} -\myindex{random variables}% - -\R\ can calculate quantities related to probability distributions of all types. -It is straightforward to generate -random samples from these distributions, which can be used -for simulation and exploration. -<>= -xpnorm(1.96, mean=0, sd=1) # P(Z < 1.96) -@ -\Rindex{qnorm()}% -\Rindex{dnorm()}% -\Rindex{pnorm()}% -\Rindex{xpnorm()}% -\Rindex{rnorm()}% -\Rindex{integrate()}% -<>= -# value which satisfies P(Z < z) = 0.975 -qnorm(.975, mean=0, sd=1) -integrate(dnorm, -Inf, 0) # P(Z < 0) -@ -The following table displays the basenames for probability distributions -available within base \R. These functions can be prefixed by {\tt d} to -find the density function for the distribution, {\tt p} to find the -cumulative distribution function, {\tt q} to find quantiles, and {\tt r} to -generate random draws. For example, to find the density function of an exponential -random variable, use the command \function{dexp()}. -The \function{qDIST()} function is the inverse of the -\function{pDIST()} function, for a given basename {\tt DIST}. -\begin{center} -\begin{tabular}{|c|c|c|} \hline -Distribution & Basename \\ \hline -Beta & {\tt beta} \\ -binomial & {\tt binom} \\ -Cauchy & {\tt cauchy} \\ -chi-square & {\tt chisq} \\ -exponential & {\tt exp} \\ -F & {\tt f} \\ -gamma & {\tt gamma} \\ -geometric & {\tt geom} \\ -hypergeometric & {\tt hyper} \\ -logistic & {\tt logis} \\ -lognormal & {\tt lnorm} \\ -negative binomial & {\tt nbinom} \\ -normal & {\tt norm} \\ -Poisson & {\tt pois} \\ -Student's t & {\tt t} \\ -Uniform & {\tt unif} \\ -Weibull & {\tt weibull} \\ \hline -\end{tabular} -\end{center} -\DiggingDeeper{The \function{fitdistr()} within the \pkg{MASS} package facilitates estimation -of parameters for many distributions.} -The \function{plotDist()} can be used to display distributions in a variety of ways. -<>= -plotDist('norm', mean=100, sd=10, kind='cdf') -@ -<>= -plotDist('exp', kind='histogram', xlab="x") -@ -<>= -plotDist('binom', size=25, prob=0.25, xlim=c(-1,26)) -@ -\begin{problem} -Generate a sample of 1000 exponential random variables with rate parameter -equal to 2, and calculate the mean of those variables. -\end{problem} -\begin{solution} -<>= -x <- rexp(1000, rate=2) -mean(x) -@ -\end{solution} - -\begin{problem} -Find the median of the random variable X, if it is exponentially distributed -with rate parameter 10. -\end{problem} -\begin{solution} -<>= -qexp(.5, rate=10) -@ -\end{solution} - - -\chapter{Power Calculations} -\label{chap:onesamppower} - -While not generally a major topic in introductory courses, power and sample size calculations -help to reinforce key ideas in statistics. In this section, we will explore how \R\ can -be used to undertake power calculations using analytic approaches. -We consider a simple problem with two tests (t-test and -sign test) of -a one-sided comparison. - -We will compare the power of the sign test and the power of the test based on normal theory (one sample one sided t-test) assuming that $\sigma$ -is known. -Let $X_1, ..., X_{25}$ be i.i.d. $N(0.3, 1)$ (this is the alternate that we wish to calculate power for). Consider testing the null hypothesis $H_0: \mu=0$ versus $H_A: \mu>0$ at significance level $\alpha=.05$. - -\section{Sign test} - -We start by calculating the Type I error rate for the sign test. Here we want to -reject when the number of positive values is large. Under the null hypothesis, this is -distributed as a Binomial random variable with n=25 trials and p=0.5 probability of being -a positive value. Let's consider values between 15 and 19. - -<>= -xvals <- 15:19 -probs <- 1 - pbinom(xvals, size=25, prob=0.5) -cbind(xvals, probs) -qbinom(.95, size=25, prob=0.5) -@ -So we see that if we decide to reject when the number of positive values is -17 or larger, we will have an $\alpha$ level of \Sexpr{round(1-pbinom(16, 25, 0.5), 3)}, -which is near the nominal value in the problem. - -We calculate the power of the sign test as follows. The probability that $X_i > 0$, given that $H_A$ is true is given by: -<>= -1 - pnorm(0, mean=0.3, sd=1) -@ -We can view this graphically using the command: -\begin{center} -<>= -xpnorm(0, mean=0.3, sd=1, lower.tail=FALSE) -@ -\end{center} -The power under the alternative is equal to the probability of getting 17 or more positive values, -given that $p=0.6179$: - -\Rindex{pbinom()}% -<>= -1 - pbinom(16, size=25, prob=0.6179) -@ -The power is modest at best. - -\section{T-test} - -We next calculate the power of the test based on normal theory. To keep the comparison -fair, we will set our $\alpha$ level equal to 0.05388. - -<>= -alpha <- 1-pbinom(16, size=25, prob=0.5); alpha -@ - -First we find the rejection region. -<>= -n <- 25; sigma <- 1 # given -stderr <- sigma/sqrt(n) -zstar <- qnorm(1-alpha, mean=0, sd=1) -zstar -crit <- zstar*stderr -crit -@ - - -\noindent -Therefore, we reject for observed means greater than \Sexpr{round(crit,3)}. - -To calculate the power of this one-sided test we find the probability -under the alternative hypothesis -to the right of this cutoff. - -<<>>= -power <- 1 - pnorm(crit, mean=0.3, sd=stderr) -power -@ - -The power of the test based on normal theory is \Sexpr{round(power,3)}. -To provide a check (or for future calculations of this sort) we can use the -\function{power.t.test()} function. -<<>>= -power.t.test(n=25, delta=.3, sd=1, sig.level=alpha, alternative="one.sided", -type="one.sample")$power -@ - -This analytic (formula-based approach) yields a similar estimate to the value that we calculated directly. - -Overall, we see that the t-test has higher power than the sign test, if the underlying -data are truly normal. \TeachingTip{It's useful to have students calculate power empirically, -to demonstrate the power of simulations.} -\begin{problem} -\label{prob:power1}% -Find the power of a two-sided two-sample t-test where both distributions -are approximately normally distributed with the same standard deviation, but the group differ by 50\% of the standard deviation. Assume that there are -\Sexpr{n} -observations per group and an alpha level of \Sexpr{alpha}. -\end{problem} -\begin{solution} -<>= -n <- 100 -alpha <- 0.01 -@ -<>= -n -alpha -power.t.test(n=n, delta=.5, sd=1, sig.level=alpha) -@ -\end{solution} -\begin{problem} -Find the sample size needed to have 90\% power for a two group t-test -where the true -difference between means is 25\% of the standard deviation in the groups -(with $\alpha=0.05$). -\end{problem} -\begin{solution} -<>= -power.t.test(delta=.25, sd=1, sig.level=alpha, power=0.90) -@ -\end{solution} - - -\chapter{Data Management} -\label{sec:manipulatingData}% -\myindex{data management}% -\myindex{thinking with data}% - -\TeachingTip{The \emph{Start Teaching with R} book features an extensive section on data management, including use of the \function{read.file()} function to load data into \R\ and \RStudio.} -Data management is a key capacity to allow students (and instructors) to ``compute with data'' or -as Diane Lambert of Google has stated, ``think with data''. -We tend to keep student data management to a minimum during the early part of an introductory -statistics course, then gradually introduce topics as needed. For courses where students -undertake substantive projects, data management is more important. This chapter describes -some key data management tasks. -\myindex{read.file()}% - -\TeachingTip{The \pkg{dplyr} and \pkg{tidyr} packages provide an elegant approach to data management and facilitate the ability of students to compute with data. Hadley Wickham, author of the packages, -suggests that there are six key idioms (or verbs) implemented within these packages that allow a large set of tasks to be accomplished: -filter (keep rows matching criteria), -select (pick columns by name), -arrange (reorder rows), -mutate (add new variables), -summarise (reduce variables to values), and -group by (collapse groups).} -\section{Adding new variables to a dataframe} -\myindex{dataframe}% -We can add additional variables to an existing dataframe (name for a dataset in \R) using \function{mutate()}. But first we create a smaller version of the \dataframe{iris} dataframe. - -\myindex{iris dataset}% -<>= -irisSmall <- select(iris, Species, Sepal.Length) -@ - -\myindex{adding variables}% -\Rindex{mutate()}% -\Rindex{cut()}% -<>= -# cut places data into bins -irisSmall <- mutate(irisSmall, - Length = cut(Sepal.Length, breaks=4:8)) -@ - -\TeachingTip{The \function{cut()} function has an option called \option{labels} which can be used to specify more descriptive names for the groups.} -<<"mr-adding-variable2-again">>= -head(irisSmall) -@ -\Rindex{head()}% -\myindex{display first few rows}% - -\myindex{CPS85 dataset}% -The \dataframe{CPS85} dataframe contains data from a Current Population Survey (current in 1985, that is). -Two of the variables in this dataframe are \variable{age} and \variable{educ}. We can estimate -the number of years a worker has been in the workforce if we assume they have been in the workforce -since completing their education and that their age at graduation is 6 more than the number -of years of education obtained. We can add this as a new variable in the dataframe -using \function{mutate()}. -\myindex{CPS85 dataset}% -\Rindex{mutate()}% -<<>>= -CPS85 <- mutate(CPS85, workforce.years = age - 6 - educ) -favstats(~ workforce.years, data=CPS85) -@ -In fact this is what was done for all but one of the cases to create the \variable{exper} -variable that is already in the \dataframe{CPS85} data. -<<>>= -tally(~ (exper - workforce.years), data=CPS85) -@ - -\section{Dropping variables} -\myindex{dropping variables}% -\Rindex{filter()}% -\Rindex{select option}% -Since we already have the \variable{exper} variable, there is no reason to keep our new variable. Let's drop it. -Notice the clever use of the minus sign. - -<<>>= -names(CPS85) -CPS1 <- select(CPS85, select = -matches("workforce.years")) -names(CPS1) -@ - -Any number of variables can be dropped or kept in a similar manner. -<<>>= -CPS1 <- select(CPS85, select = -matches("workforce.years|exper")) -@ - - -\section{Renaming variables} -\myindex{renaming variables}% -\Rindex{rename()}% -\Rindex{row.names()}% -The column (variable) names for a dataframe can be changed using the \function{rename()} function in the -\pkg{dplyr} package. -<<>>= -names(CPS85) -CPSnew = rename(CPS85, workforce=workforce.years) -names(CPSnew) -@ - -The row names of a dataframes can be changed by -simple assignment using \function{row.names()}. - -\Rindex{names()}% -\myindex{faithful dataset}% -The \dataframe{faithful} data set (in the \pkg{datasets} package, which is always available) -has very unfortunate names. -\TeachingTip{It's a good idea to start teaching good practices for choice of variable names from day one.} -<<>>= -names(faithful) -@ - -The measurements are the duration of an euption and the time until the subsequent eruption, -so let's give it some better names. -<>= -faithful = rename(faithful, - duration = eruptions, - time.til.next=waiting) -names(faithful) -@ -\myindex{faithful dataset}% -\begin{center} -<<"mr-faithful-xy">>= -xyplot(time.til.next ~ duration, alpha=0.5, data=faithful) -@ -\end{center} -If the variable containing a dataframe is modified or used to store a different object, -the original data from the package can be recovered using \function{data()}. -\Rindex{data()}% -<<>>= -data(faithful) -head(faithful, 3) -@ - -\begin{problem} -Using \dataframe{faithful} dataframe, make a scatter plot of eruption duration times vs.\,the time -since the previous eruption. -\end{problem} - - -\section{Creating subsets of observations} -\myindex{creating subsets}% -\myindex{subsets of dataframes}% -\label{sec:subsets} -We can also use \function{filter()} to reduce the size of a dataframe by selecting -only certain rows. -\begin{center} -<<"mr-faithful-long-xy">>= -data(faithful) -names(faithful) <- c('duration', 'time.til.next') -# any logical can be used to create subsets -faithfulLong <- filter(faithful, duration > 3) -xyplot( time.til.next ~ duration, data=faithfulLong ) -@ -\end{center} - - -\section{Sorting dataframes} -\myindex{sorting dataframes}% -\Rindex{arrange()}% - -Data frames can be sorted using the \function{arrange()} function. -<<>>= -head(faithful, 3) -sorted <- arrange(faithful, duration) -head(sorted, 3) -@ -\Caution{It is usually better to make new datasets rather than modifying the original.} - - - - -\section{Merging datasets} -\myindex{merging dataframes}% - -The \dataframe{fusion1} dataframe in the \pkg{fastR} package contains -genotype information for a SNP (single nucleotide polymorphism) in the gene -\emph{TCF7L2}. -The \dataframe{pheno} dataframe contains phenotypes -(including type 2 diabetes case/control status) for an intersecting set of individuals. -We can join (or merge) these together to explore the association between -genotypes and phenotypes using \verb!merge()!. - -\Rindex{arrange()}% -<<>>= -require(fastR) -require(dplyr) -fusion1 <- arrange(fusion1, id) -head(fusion1, 3) -head(pheno, 3) -@ - -\Rindex{arrange()}% -\Rindex{all.x option}% -\Rindex{by.x option}% -<>= -require(tidyr) -fusion1m <- inner_join(fusion1, pheno, by='id') -head(fusion1m, 3) -@ -\Rindex{tidyr package}% - -\myindex{fusion1 dataset}% -Now we are ready to begin our analysis. -<<"mr-fusion1-xtabs">>= -tally(~t2d + genotype, data=fusion1m) -@ - -\begin{problem} -The \dataframe{fusion2} data set in the \pkg{fastR} package contains genotypes for -another SNP. Merge \dataframe{fusion1}, \dataframe{fusion2}, and \dataframe{pheno} into a single data -frame. - -Note that \dataframe{fusion1} and \dataframe{fusion2} have the same columns. -<<>>= -names(fusion1) -names(fusion2) -@ -You may want to use the \option{suffixes} argument to \function{merge()} or rename the variables -after you are done merging to make the resulting dataframe easier to navigate. - -Tidy up your dataframe by dropping any columns that are redundant or that you just don't want to -have in your final dataframe. -\end{problem} - -\section{Slicing and dicing} -\myindex{reshaping dataframes}% -\myindex{transforming dataframes}% -\myindex{transposing dataframes}% -The \pkg{tidyr} package provides a flexible way to change the arrangement of data. -It was designed for converting between long and wide versions of -time series data and its arguments are named with that in mind. -\TeachingTip{The vignettes that accompany the \pkg{tidyr} and \pkg{dplyr} packages feature a number of useful examples of common data manipulations.} - - -A common situation is when we want to convert from a wide form to a -long form because of a change in perspective about what a unit of -observation is. For example, in the \dataframe{traffic} dataframe, each -row is a year, and data for multiple states are provided. - -<<"mr-traffic-reshape">>= -traffic -@ -We can reformat this so that each row contains a measurement for a -single state in a particular year. - -\Rindex{gather()}% -<>= -longTraffic <- traffic %>% - gather(state, deathRate, ny:ri) -head(longTraffic) -@ - -We can also reformat the other way, this time having all data for a given state -form a row in the dataframe. -<>= -stateTraffic <- longTraffic %>% - select(year, deathRate, state) %>% - mutate(year=paste("deathRate.", year, sep="")) %>% - spread(year, deathRate) -stateTraffic -@ -\Rindex{spread()}% -\Rindex{select()}% -\Rindex{mutate()}% -\Rindex{paste()}% - -\section{Derived variable creation} -\myindex{derived variables} - -A number of functions help facilitate the creation or recoding of variables. - -\subsection{Creating categorical variable from a quantitative variable} - -Next we demonstrate how to -create a three-level categorical variable -with cuts at 20 and 40 for the CESD scale (which ranges from 0 to 60 points). - -\Rindex{cut()}% -\Rindex{mutate()}% -\Rindex{include.lowest option}% -\Rindex{breaks option}% -<>= -favstats(~ cesd, data=HELPrct) -HELPrct = mutate(HELPrct, cesdcut = cut(cesd, - breaks=c(0, 20, 40, 60), include.lowest=TRUE)) -bwplot(cesd ~ cesdcut, data=HELPrct) -@ -\Rindex{ntiles()}% -\TeachingTip{The \function{ntiles} function can be used to automate creation of groups in this manner.} - -It might be preferable to give better labels. -<>= -HELPrct = mutate(HELPrct, cesdcut = cut(cesd, - labels=c("low", "medium", "high"), - breaks=c(0, 20, 40, 60), include.lowest=TRUE)) -bwplot(cesd ~ cesdcut, data=HELPrct) -@ - - -\subsection{Reordering factors} -\myindex{reordering factors}% -\myindex{factor reordering}% -\Rindex{relevel()}% -\Rindex{mutate()}% -\Rindex{coef()}% -\Rindex{tally()}% -By default R uses the first level in lexicographic order as the reference group for modeling. This -can be overriden using the \function{relevel()} function (see also \function{reorder()}). -<>= -tally(~ substance, data=HELPrct) -coef(lm(cesd ~ substance, data=HELPrct)) -HELPrct = mutate(HELPrct, subnew = relevel(substance, - ref="heroin")) -coef(lm(cesd ~ subnew, data=HELPrct)) -@ - -\section{Group-wise statistics} -\label{sec:groupby} - -\myindex{group-wise statistics}% -\Rindex{select()}% - -It can often be useful to calculate summary statistics by group, and add -these into a dataset. The \function{group_by} function in the \pkg{dplyr} package -facilitates this process. Here we demonstrate how to add a variable containing -the median age of subjects by substance group. - -\Rindex{favstats()}% -\Rindex{group\_by()}% -\Rindex{left\_join()}% -\Rindex{summarise()}% -<>= -favstats(age ~ substance, data=HELPrct) -ageGroup <- HELPrct %>% - group_by(substance) %>% - summarise(agebygroup = mean(age)) -ageGroup -HELPmerged <- left_join(ageGroup, HELPrct, by="substance") -favstats(agebygroup ~ substance, data=HELPmerged) -@ - - -\section{Accounting for missing data} -\label{sec:miss} - -\myindex{missing data}% -\myindex{incomplete data}% -\Rindex{select()}% -\Rindex{dim()}% -\Rindex{NA character}% -Missing values arise in almost all real world investigations. R uses the \variable{NA} character as an -indicator for missing data. The \dataframe{HELPmiss} dataframe within the \pkg{mosaicData} package includes all -$n=470$ subjects enrolled at baseline (including the $n=17$ subjects with some missing data who -were not included in \dataframe{HELPrct}). -\myindex{HELPmiss dataset}% -<>= -smaller = select(HELPmiss, cesd, drugrisk, indtot, mcs, pcs, - substance) -dim(smaller) -summary(smaller) -@ - -Of the 470 subjects in the 6 variable dataframe, only the \code{drugrisk}, \code{indtot}, \code{mcs}, and \code{pcs} variables have missing values. - -\Rindex{with()}% -\Rindex{na.omit()}% -\Rindex{favstats()}% -\Rindex{is.na()}% -<>= -favstats(~ mcs, data=smaller) -with(smaller, sum(is.na(mcs))) -nomiss <- na.omit(smaller) -dim(nomiss) -favstats(~ mcs, data=nomiss) -@ - -Alternatively, we could generate the same dataset using logical conditions. -<>= -nomiss <- filter(smaller, - (!is.na(mcs) & !is.na(indtot) & !is.na(drugrisk))) -dim(nomiss) -@ - -\chapter{Health Evaluation and Linkage to Primary Care (HELP) Study} - -\label{sec:help} - -\myindex{HELP study}% -\myindex{Health Evaluation and Linkage to Primary Care study}% -Many of the examples in this guide utilize data from the HELP study, -a randomized clinical trial for adult inpatients recruited from a detoxification unit. -Patients with no primary care physician were randomized to receive a multidisciplinary assessment and a brief motivational intervention or usual care, -with the goal of linking them to primary medical care. -Funding for the HELP study was provided by the National Institute -on Alcohol Abuse and Alcoholism (R01-AA10870, Samet PI) and -National Institute on Drug Abuse (R01-DA10019, Samet PI). -The details of the -randomized trial along with the results from a series of additional analyses have been published\cite{same:lars:hort:2003,lieb:save:2002,kert:hort:frie:2003}. - -Eligible subjects were -adults, who spoke Spanish or English, reported alcohol, heroin or -cocaine as their first or second drug of choice, resided in proximity -to the primary care clinic to which they would be referred or were -homeless. Patients with established primary care relationships -they planned to continue, significant dementia, specific plans to -leave the Boston area that would prevent research participation, -failure to provide contact information for tracking purposes, or -pregnancy were excluded. - -Subjects were interviewed at baseline during -their detoxification stay and follow-up interviews were undertaken -every 6 months for 2 years. A variety of continuous, count, discrete, and survival time predictors and outcomes were collected at each of these five occasions. -The Institutional Review Board of -Boston University Medical Center approved all aspects of the study, including the creation of the de-identified dataset. Additional -privacy protection was secured by the issuance of a Certificate of -Confidentiality by the Department of Health and Human Services. - -The \pkg{mosaicData} package contains several forms of the de-identified HELP dataset. -We will focus on \pkg{HELPrct}, which contains -27 variables for the 453 subjects -with minimal missing data, primarily at baseline. -Variables included in the HELP dataset are described in Table \ref{tab:helpvars}. More information can be found here\cite{horton:kleinman:2015}. -A copy of the study instruments can be found at: \url{http://www.amherst.edu/~nhorton/help}. -\begin{longtable}{|p{2.1cm}|p{6.8cm}|p{3.5cm}|} -\caption{Annotated description of variables in the \dataframe{HELPrct} dataset} -\label{tab:helpvars} \\ -\hline -VARIABLE & DESCRIPTION (VALUES) & NOTE \\ \hline -\variable{age} & age at baseline (in years) (range 19--60) & \\ \hline -\variable{anysub} & use of any substance post-detox & see also \variable{daysanysub} -\\ \hline -\variable{cesd} & Center for Epidemiologic Studies Depression scale (range 0--60, higher scores indicate more depressive symptoms) & \\ \hline -\variable{d1} & how many times hospitalized for medical problems (lifetime) (range 0--100) & \\ \hline -\variable{daysanysub} & time (in days) to first use of any substance post-detox (range 0--268) & see also \variable{anysubstatus} \\ \hline -\variable{dayslink} & time (in days) to linkage to primary care (range 0--456) & see also \variable{linkstatus} -\\ \hline -\variable{drugrisk} & Risk-Assessment Battery (RAB) drug risk score (range 0--21) & see also \variable{sexrisk} -\\ \hline -\variable{e2b} & number of times in past 6 months entered a detox program (range 1--21) & \\ \hline -\variable{female} & gender of respondent (0=male, 1=female) & -\\ \hline -\variable{g1b} & experienced serious thoughts of suicide (last 30 days, values 0=no, 1=yes) & -\\ \hline -\variable{homeless} & 1 or more nights on the street or shelter in past 6 months (0=no, 1=yes) & -\\ \hline -\variable{i1} & average number of drinks (standard units) consumed per day (in the past 30 days, range 0--142) & see also \variable{i2} -\\ \hline -\variable{i2} & maximum number of drinks (standard units) consumed per day (in the past 30 days range 0--184) & see also \variable{i1} -\\ \hline -\variable{id} & random subject identifier (range 1--470) & -\\ \hline -\variable{indtot} & Inventory of Drug Use Consequences (InDUC) total score (range 4--45) & -\\ \hline -\variable{linkstatus} & post-detox linkage to primary care (0=no, 1=yes) & see also \variable{dayslink} -\\ \hline -\variable{mcs} & SF-36 Mental Component Score (range 7-62, higher scores are better) & see also \variable{pcs} -\\ \hline -\variable{pcs} & SF-36 Physical Component Score (range 14-75, higher scores are better) & see also \variable{mcs} -\\ \hline -\variable{pss\_fr} & perceived social supports (friends, range 0--14) & -\\ \hline -\variable{racegrp} & race/ethnicity (black, white, hispanic or other) & \\ \hline -\variable{satreat} & any BSAS substance abuse treatment at baseline (0=no, 1=yes) & \\ \hline -\variable{sex} & sex of respondent (male or female) & \\ \hline -\variable{sexrisk} & Risk-Assessment Battery (RAB) sex risk score (range 0--21) & see also \variable{drugrisk} -\\ \hline -\variable{substance} & primary substance of abuse (alcohol, cocaine or heroin) & -\\ \hline -\variable{treat} & randomization group (randomize to HELP clinic, no or yes) & -\\ \hline -\end{longtable} -\noindent -Notes: Observed range is provided (at baseline) for continuous variables. - - -\chapter{Exercises and Problems} - -\shipoutProblems - -\bibliographystyle{alpha} -\bibliography{../include/USCOTS} diff --git a/Traditional/Master-Core.Rnw b/Traditional/Master-Core.Rnw deleted file mode 100644 index e1bdc3f..0000000 --- a/Traditional/Master-Core.Rnw +++ /dev/null @@ -1,68 +0,0 @@ - - -\documentclass[open-any,12pt]{tufte-book} - -\usepackage{../include/RBook} -\usepackage{pdfpages} -%\usepackage[shownotes]{authNote} -\usepackage[hidenotes]{authNote} - -\def\tilde{\texttt{\~}} - -\title{A Compendium of R Commands to Teach Statistics} -\author{Nicholas J. Horton,\\Daniel T. Kaplan and\\Randall Pruim} -\date{DRAFT: \today} - -\renewenvironment{knitrout}{\relax}{\noindent} - -<>= -require(grDevices) -require(datasets) -require(stats) -require(lattice) -require(grid) -require(mosaic) -require(mosaicData) -trellis.par.set(theme=col.mosaic(bw=FALSE)) -trellis.par.set(fontsize=list(text=9)) -options(keep.blank.line=FALSE) -options(width=60) -require(vcd) -require(knitr) -opts_chunk$set( tidy=TRUE, - size='small', - dev="pdf", - fig.path="figures/fig-", - fig.width=3, fig.height=2, - fig.align="center", - fig.show="hold", - comment=NA) -@ -<>= -knit_hooks$set(document = function(x) { - sub('\\usepackage[]{color}', '\\usepackage[]{xcolor}', - x, fixed = TRUE) -}) -@ -\begin{document} - - -%\maketitle - -\includepdf{USCOTS-cover} - -\newpage - -\tableofcontents - -\newpage - -<>= -@ - -<>= -@ - -\printindex - -\end{document} diff --git a/Traditional/Master/Master-Core.Rnw b/Traditional/Master/Master-Core.Rnw deleted file mode 100644 index 92c795a..0000000 --- a/Traditional/Master/Master-Core.Rnw +++ /dev/null @@ -1,13 +0,0 @@ -% All pre-amble stuff should go into ../include/MainDocument.Rnw -\title{R for the Core of a Traditional Course} -\author{Randall Pruim and Nicholas Horton and Daniel Kaplan} -\date{DRAFT: \today} -\Sexpr{set_parent('../../include/MainDocument.Rnw')} % All the latex pre-amble for the book -\maketitle - -\tableofcontents - -\newpage - -\import{../}{Core} - diff --git a/Traditional/Outline-Traditional.Rmd b/Traditional/Outline-Traditional.Rmd deleted file mode 100644 index a353933..0000000 --- a/Traditional/Outline-Traditional.Rmd +++ /dev/null @@ -1 +0,0 @@ -## R Core for a Traditional Course: Outline diff --git a/Traditional/USCOTS-cover.pdf b/Traditional/USCOTS-cover.pdf deleted file mode 100644 index fce01b7..0000000 Binary files a/Traditional/USCOTS-cover.pdf and /dev/null differ diff --git a/Traditional/core-cover.pptx b/Traditional/core-cover.pptx deleted file mode 100644 index 7780bf9..0000000 Binary files a/Traditional/core-cover.pptx and /dev/null differ diff --git a/Traditional/makefile b/Traditional/makefile deleted file mode 100644 index e4e4884..0000000 --- a/Traditional/makefile +++ /dev/null @@ -1,12 +0,0 @@ -all: Master-Core.pdf - -Master-Core.pdf: Master-Core.tex - pdflatex Master-Core - bibtex Master-Core - makeindex Master-Core - pdflatex Master-Core - -Master-Core.tex: Master-Core.Rnw Core.Rnw - knitr Master-Core.Rnw - - diff --git a/_output.yaml b/_output.yaml new file mode 100644 index 0000000..a2c163b --- /dev/null +++ b/_output.yaml @@ -0,0 +1,2 @@ + html_document: + keep_md: yes diff --git a/include/RBook.sty b/include/RBook.sty index 018e3e9..5f7822c 100644 --- a/include/RBook.sty +++ b/include/RBook.sty @@ -99,8 +99,8 @@ \tikzNote[#1]{\centerline{Note}}{#2}{}% } -\newcommand{\BlankNote}[1][0pt]{% -\tikzNote[#1]{}{}{}% +\newcommand{\BlankNote}[2][0pt]{% +\tikzNote[#1]{\relax}{#2}{}% } diff --git a/include/USCOTS.bib b/include/USCOTS.bib index 8272215..aeb2730 100644 --- a/include/USCOTS.bib +++ b/include/USCOTS.bib @@ -22,7 +22,7 @@ @Book{salsburg } @book{Sleuth2, - author = {Fred Ramsey and Dan Schafer}, + author = {F. Ramsey and D. Schafer}, title = {Statistical Sleuth: A Course in Methods of Data Analysis}, edition={2nd}, year = {2002}, @@ -726,12 +726,12 @@ @qut.edu.au @TechReport{ASAcurriculum2014, - author={Undergraduate Guidelines Workshop}, + author={ASA Undergraduate Guidelines Workgroup}, title={2014 Curriculum Guidelines for Undergraduate Programs in Statistical Science}, year=2014, month=Nov, institution={American Statistical Association}, - url={http://www.amstat.org/education/pdfs/guidelines2014-11-15.pdf} + note={\url{http://www.amstat.org/education/curriculumguidelines.cfm}} } @TechReport{RePEc:nbr:nberwo:4521, @@ -1921,10 +1921,271 @@ @Article{galton-co-relations } @article{baum:2014, -Author = {Ben Baumer and Mine \c{C}etinkaya-Rundel and Andrew Bray and Linda Loi and Nicholas J. Horton}, +Author = {B.S. Baumer and M. \c{C}etinkaya-Rundel and A. Bray and L. Loi and N. J. Horton}, Journal = {Technology Innovations in Statistics Education}, Pages = {281-283}, Title = {{R Markdown}: Integrating A Reproducible Analysis Tool into Introductory Statistics}, Volume = {8}, Number = {1}, Year = {2014}} + +@article{hort:2015, +Author = {N.J. Horton and B.S. Baumer and H. Wickham}, +Journal = {CHANCE}, +Pages = {40-50}, +Title = {Setting the stage for data science: integration of data management skills in introductory and second courses in statistics (\url{http://arxiv.org/abs/1401.3269})}, +Volume = {28}, +Number = {2}, +Year = {2015}} + + +@inproceedings{Pruim:MinimalR:2011, + title = {Teaching Statistics with {R}}, + author = {Randall Pruim}, + year = 2011, + booktitle = {Joint Statistics Meetings Roundtable} + } + +@inproceedings{Wang:USCOTS:2015, + title = {Visualization as the Gateway Drug to Statistics in Week One}, + author = {Xiaofei (Susan) Wang and Cynthia Rush}, + year = 2015, + booktitle = {United States Conference on Teaching Statistics} + } + +@Manual{R, + title = {R: A Language and Environment for Statistical Computing}, + author = {{R Core Team}}, + organization = {R Foundation for Statistical Computing}, + address = {Vienna, Austria}, + year = {2012}, + note = {{ISBN} 3-900051-07-0}, + url = {http://www.R-project.org/}, +} + +@article{ihaka:1996, + Author = {Ihaka, Ross and Gentleman, Robert}, + Journal = {Journal of Computational and Graphical Statistics}, + Number = 3, + Pages = {299--314}, + Title = {R: A Language for Data Analysis and Graphics}, + Volume = 5, + Year = 1996} + +@article{Baumer:RMarkdown:2014, + author = {Baumer, Ben and {Cetinkaya-Rundel}, Mine and + Bray, Andrew and Loi, Linda and Horton, Nicholas J}, + year = {2014}, + title = {R Markdown: Integrating A Reproducible Analysis Tool into Introductory Statistics}, + journal = {Technology Innovations in Statistics Education}, + volume = 8, + number = 1, + url = {http://escholarship.org/uc/item/90b2f5xh} + } + +@article{Wild:RSS:2011, + Author = {Wild, C J and Pfannkuch M and Regan M and Horton N J}, + Title = {Towards more accessible conceptions of statistical inference}, + Journal = {Journal of the Royal Statistical Society: Series A (Statistics in Society)}, + Year = 2011, + volume = {174 (part 2)}, + pages = {247--295} + } + +@article{Hesterberg:2015, + Author = {Tim C. Hesterberg}, + Title = {What Teachers Should Know about the Bootstrap: Resampling in the Undergraduate Statistics Curriculum}, + Year = 2015, + Journal = {The American Statistician} +} + +@article{NolanTempleLang:2010, + Author = {Deborah Nolan and Duncan Temple Lang}, + Year = 2010, + Title = {Computing in the Statistics Curricula}, + Journal = {The American Statistician}, + Volume = 64, + Number = 2, + pages = {97--107}, + url = {http://dx.doi.org/10.1198/tast.2010.09132} +} + +@article{Ridgway:2015, + Title = {Implications of the Data Revolution for Statistics Education}, + Author = {Ridgway, J}, + Journal = {International Statistical Review}, + Year = 2015 +} + +@article{HortonBaumerWickham:2015, + Author = {Nicholas J. Horton and Benjamin S. Baumer and Hadley Wickham}, + Year = 2015, + Title = {Setting the stage for data science: integration of data management skills in introductory and second courses in statistics}, + journal = {CHANCE}, + Volume = 28, + Number = 2, + Pages = {40--50} +} + +@ARTICLE{2015arXiv150200318H, + author = {{Horton}, N.~J. and {Baumer}, B.~S. and {Wickham}, H.}, + title = "{Setting the stage for data science: integration of data management skills in introductory and second courses in statistics}", + journal = {ArXiv e-prints}, +archivePrefix = "arXiv", + eprint = {1502.00318}, + primaryClass = "stat.CO", + keywords = {Statistics - Computation, Computer Science - Computers and Society, Statistics - Other Statistics, 62-01}, + year = 2015, + month = feb, + adsurl = {http://adsabs.harvard.edu/abs/2015arXiv150200318H}, + adsnote = {Provided by the SAO/NASA Astrophysics Data System} +} + + +@ARTICLE{Tintle:TAS:2015, + author = {{Tintle}, N. and {Chance}, B. and {Cobb}, G. and {Roy}, S. and + {Swanson}, T. and {VanderStoep}, J.}, + title = "{Combating anti-statistical thinking using simulation-based methods throughout the undergraduate curriculum}", + journal = {The American Statistician}, + volume = 69, + number = 4, + year = 2015, + adsurl = {http://adsabs.harvard.edu/abs/2015arXiv150800543T}, + adsnote = {Provided by the SAO/NASA Astrophysics Data System} +} + +@ARTICLE{Grolemund:ISR:2014, +title = {A Cognitive Interpretation of Data Analysis}, +author = {Grolemund, Garrett and Wickham, Hadley}, +year = {2014}, +journal = {International Statistical Review}, +volume = {82}, +number = {2}, +pages = {184--204}, +abstract = {This paper proposes a scientific model to explain the analysis +process. We argue that data analysis is primarily a procedure to build +understanding, and as such, it dovetails with the cognitive processes of the +human mind. Data analysis tasks closely resemble the cognitive process known as +sensemaking. We demonstrate how data analysis is a sensemaking task adapted to +use quantitative data. This identification highlights a universal structure +within data analysis activities and provides a foundation for a theory of data +analysis. The competing tensions of cognitive compatibility and scientific +rigour create a series of problems that characterise the data analysis process. +These problems form a useful organising model for the data analysis task while +allowing methods to remain flexible and situation dependent. The insights of +this model are especially helpful for consultants, applied statisticians and +teachers of data analysis.}, +url = {http://EconPapers.repec.org/RePEc:bla:istatr:v:82:y:2014:i:2:p:184-204} +} + +@book{Salsburg:2002, + abstract = {{Science is inextricably linked with mathematics. Statistician David Salsburg examines the development of ever-more-powerful statistical methods for determining scientific truth in The Lady Tasting Tea, a series of historical and biographical sketches that illuminate without alienating the mathematically timid. Salsburg, who has worked in academia and industry and has met many of the major players he writes about, shares his subjects' enthusiasm for problem solving and deep thinking. His sense of excitement drives the prose, but never at the expense of the reader; if anything, the author has taken pains to eliminate esoterica and ephemera from his stories. This might frustrate a few number-head readers, but the abundant notes and references should keep them happy in the library for weeks after reading the book.

    Ultimately, the various tales herein are unified in a single theme: the conversion of science from observational natural history into rigorously defined statistical models of data collection and analysis. This process, usually only implicit in studies of scientific methods and history, is especially important now that we seem to be reaching the point of diminishing returns and are looking for new paradigms of scientific investigation. The Lady Tasting Tea will appeal to a broad audience of scientifically literate readers, reminding them of the humanity underlying the work. --Rob Lightner} {At a summer tea party in Cambridge, England, a guest states that tea poured into milk tastes different from milk poured into tea. Her notion is shouted down by the scientific minds of the group. But one man, Ronald Fisher, proposes to scientifically test the hypothesis. There is no better person to conduct such an experiment, for Fisher is a pioneer in the field of statistics. The Lady Tasting Tea spotlights not only Fishers theories but also the revolutionary ideas of dozens of men and women which affect our modern everyday lives. Writing with verve and wit, David Salsburg traces breakthroughs ranging from the rise and fall of Karl Pearsons theories to the methods of quality control that rebuilt postwar Japans economy, including a pivotal early study on the capacity of a small beer cask at the Guinness brewing factory. Brimming with intriguing tidbits and colorful characters, The Lady Tasting Tea salutes the spirit of those who dared to look at the world in a new way.}}, + author = {Salsburg, David}, + citeulike-article-id = {1660512}, + citeulike-linkout-0 = {http://www.amazon.ca/exec/obidos/redirect?tag=citeulike09-20\&path=ASIN/0805071342}, + citeulike-linkout-1 = {http://www.amazon.de/exec/obidos/redirect?tag=citeulike01-21\&path=ASIN/0805071342}, + citeulike-linkout-2 = {http://www.amazon.fr/exec/obidos/redirect?tag=citeulike06-21\&path=ASIN/0805071342}, + citeulike-linkout-3 = {http://www.amazon.jp/exec/obidos/ASIN/0805071342}, + citeulike-linkout-4 = {http://www.amazon.co.uk/exec/obidos/ASIN/0805071342/citeulike00-21}, + citeulike-linkout-5 = {http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20\&path=ASIN/0805071342}, + citeulike-linkout-6 = {http://www.worldcat.org/isbn/0805071342}, + citeulike-linkout-7 = {http://books.google.com/books?vid=ISBN0805071342}, + citeulike-linkout-8 = {http://www.amazon.com/gp/search?keywords=0805071342\&index=books\&linkCode=qs}, + citeulike-linkout-9 = {http://www.librarything.com/isbn/0805071342}, + day = {01}, + howpublished = {Paperback}, + isbn = {0805071342}, + keywords = {book, reference, stat}, + month = may, + posted-at = {2008-03-01 08:30:11}, + priority = {2}, + publisher = {Holt Paperbacks}, + title = {{The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century}}, + year = {2002} +} + + + @Manual{mosaic, + title = {mosaic: Project MOSAIC Statistics and Mathematics Teaching Utilities}, + author = {Randall Pruim and Daniel Kaplan and Nicholas Horton}, + year = {2015}, + note = {R package version 0.12.9003}, + } + + + @Book{lattice, + title = {Lattice: Multivariate Data Visualization with R}, + author = {Deepayan Sarkar}, + publisher = {Springer}, + address = {New York}, + year = {2008}, + note = {ISBN 978-0-387-75968-5}, + url = {http://lmdvr.r-forge.r-project.org}, + } + + + + @Book{ggplot2, + author = {Hadley Wickham}, + title = {ggplot2: Elegant Graphics for Data Analysis}, + publisher = {Springer-Verlag New York}, + year = {2009}, + isbn = {978-0-387-98140-6}, + url = {http://had.co.nz/ggplot2/book}, + } + + + @Manual{resample, + title = {resample: Resampling Functions}, + author = {Tim Hesterberg}, + year = {2015}, + note = {R package version 0.4}, + url = {https://CRAN.R-project.org/package=resample} + } + + @Manual{dplyr, + title = {dplyr: A Grammar of Data Manipulation}, + author = {Hadley Wickham and Romain Francois}, + note = {R package version 0.4.3.9000}, + url = {https://github.com/hadley/dplyr}, + year = 2015 + } + + @Manual{mosaicData, + title = {mosaicData: Project MOSAIC (mosaic-web.org) data sets}, + author = {Randall Pruim and Daniel Kaplan and Nicholas Horton}, + year = {2015}, + note = {R package version 0.9.9001}, + } + +@book{Lock5:2012, + title={Statistics: Unlocking the Power of Data}, + author={Lock, Robin H and Lock, Patti Frazer and Morgan, Kari Lock}, + year={2012}, + publisher={Wiley Global Education} +} + +@book{Tintle:ISI:2015, + title = {Introduction to Statistical Investigations}, + author = {Nathan Tintle and Beth Chance and George Cobb and Allan Rossman and Soma Roy + and Todd Swanson and Jill VanderStoep}, + publisher={Wiley Global Education}, + year = 2015 + } + +@Manual{boot, + title = {boot: Bootstrap R (S-Plus) Functions}, + author = {Angelo Canty and Brian Ripley}, + year = {2015}, + note = {R package version 1.3-17}, + url = {https://CRAN.R-project.org/package=boot}, +} + +@book{boot-book, + author = {Davison, A. C. and Hinkley, D. V.}, + year = {1997}, + title = {Bootstrap Methods and Their Applications}, + publisher = {Cambridge University Press}, + location = {Cambridge}, + isbn = {0-521-57391-2} + } \ No newline at end of file diff --git a/include/authNote.sty b/include/authNote.sty new file mode 100644 index 0000000..e332349 --- /dev/null +++ b/include/authNote.sty @@ -0,0 +1,209 @@ + +\NeedsTeXFormat{LaTeX2e}[1995/12/01] +\ProvidesPackage{authNote}[2005/06/14 1.0 (RJP)] + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% my package requirements +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\RequirePackage{ifthen} + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% options and booleans for them +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +%\reversemarginpar % ?? + +%\smartqed % makes \qed print at the rightmargin + +\newboolean{shownotes} +\setboolean{shownotes}{true} +\DeclareOption{hidenotes}{\setboolean{shownotes}{false}} +\DeclareOption{shownotes}{\setboolean{shownotes}{true}} +\DeclareOption{hide}{\setboolean{shownotes}{false}} +\DeclareOption{show}{\setboolean{shownotes}{true}} + +\newboolean{showhmm} +\setboolean{showhmm}{true} +\DeclareOption{hidehmm}{\setboolean{showhmm}{false}} +\DeclareOption{showhmm}{\setboolean{showhmm}{true}} + +\newboolean{showopt} +\setboolean{showopt}{true} +\DeclareOption{hideopt}{\setboolean{showopt}{false}} +\DeclareOption{showopt}{\setboolean{showopt}{true}} + +\newboolean{showold} +\setboolean{showold}{false} +\DeclareOption{showold}{\setboolean{showold}{true}} +\DeclareOption{hideold}{\setboolean{showold}{false}} + +\DeclareOption{primary}{% + \setboolean{showhmm}{true} + \setboolean{showopt}{true} + \setboolean{shownotes}{true} + \setboolean{showold}{false} + } + +\DeclareOption{secondary}{% + \setboolean{showhmm}{false} + \setboolean{showopt}{false} + \setboolean{shownotes}{true} + \setboolean{showold}{false} + } + +\DeclareOption{clean}{% + \setboolean{showhmm}{false} + \setboolean{showopt}{false} + \setboolean{shownotes}{false} + \setboolean{showold}{false} + } + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + + +\ProcessOptions* + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%Translation Helps +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +%\def\transbox#1{{\textbf{#1}}} +\def\transbox#1{{\textbf{#1}}} + +\def\hmm[#1]#2{\ifthenelse% + {\boolean{showhmm}}% + {\transbox{#2}\smallmarginpar{#1}{}}% + {#2}% +} + +\def\hmmok[#1]#2{% + \ifthenelse{\boolean{showold}}% + {\transbox{#2}\smallmarginpar{#1}}% + {#2}% +} + +\def\hmmOK[#1]#2{% + \ifthenelse{\boolean{showold}}% + {\transbox{#2}\smallmarginpar{#1}}% + {#2}% +} + +\newcommand{\options}[2]{% + \ifthenelse{\boolean{showopt}}% +%{\textbf{$\mathbf<$#1 $\mathbf\mid$ #2$\mathbf >$}\smallmarginpar{$<\mid>$}}% + {% + \smallmarginpar{$<$#1$\mid$#2$>$}% + #1% + }% + {#1}% +} + + +\newcommand{\optionsok}[2]{% + \ifthenelse{\boolean{showold}}% +% {\textbf{$\mathbf<$#1$\mathbf >$}\smallmarginpar{$(<\mid>)$}}% + {% + \smallmarginpar{$<$#1$\mid$#2$>$}% + #1% + }% + {#1}% + } + +\newcommand{\optionsOk}[2]{% + \ifthenelse{\boolean{showold}}% +% {\textbf{$\mathbf<$#1$\mathbf >$}\smallmarginpar{$(<\mid>)$}}% + {% + \smallmarginpar{$<$#1$\mid$#2$>$}% + #1% + }% + {#1}% + } + +\newcommand{\optionsOne}[2]{% + \ifthenelse{\boolean{showold}}% + {\textbf{$\mathbf<$#1$\mathbf >$}\smallmarginpar{$(<\mid>)$}}% + {#1}% + } + +\newcommand{\optionsone}[2]{% + \ifthenelse{\boolean{showold}}% + {\textbf{$\mathbf<$#1$\mathbf >$}\smallmarginpar{$(<\mid>)$}}% + {#1}% + } + +\newcommand{\optionsTwo}[2]{% + \ifthenelse{\boolean{showold}}% + {\textbf{$\mathbf<$#2$\mathbf >$}\smallmarginpar{$(<\mid>)$}}% + {#2}% + } + +\newcommand{\optionstwo}[2]{% + \ifthenelse{\boolean{showold}}% + {\textbf{$\mathbf<$#2$\mathbf >$}\smallmarginpar{$(<\mid>)$}}% + {#2}% + } + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% authNote stuff +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +\newtoks\tempTok +\newcounter{noteNum}[section] +\newwrite\noteFile% +\immediate\openout\noteFile=\jobname.notes + +\long\def\saveNote#1{% +\refstepcounter{noteNum}% +\immediate\write\noteFile{\string\begingroup\string\bf }% +\immediate\write\noteFile{\thesection .% +\expandafter\arabic{noteNum}}% +\immediate\write\noteFile{(p. \expandafter\thepage): }% +\immediate\write\noteFile{\string\endgroup}% +\tempTok={#1} +\immediate\write\noteFile{\the\tempTok}% +\immediate\write\noteFile{}% +} + + +\def\smallmarginpar#1{\marginpar[\hfill \tiny #1]{\raggedright \tiny #1 \hfill}} + +\long\def\saveNshowNote#1#2{% + \saveNote{#2}% + \ifthenelse{\boolean{shownotes}}{% + \marginpar[\hfill {\tiny #1 + \thesection.\arabic{noteNum} $\rightarrow$}]% + {{\tiny $\leftarrow$ \thesection.\arabic{noteNum} #1 \hfill}}% + }{\relax}% +} + +% to remove marginal notes (for submissions, etc) use below instead: + +\long\def\authNote#1{\saveNshowNote{}{#1}} +\long\def\oldNote#1{\saveNshowNote{old}{old: #1}} +\long\def\authNoted#1{% +\ifthenelse{\boolean{showold}}% +{\saveNshowNote{$\surd$}{(Done) #1}}% +{\relax}% +} + +\long\def\authNotedOld#1{\relax} + + +\def\authNotes{% +\ifthenelse{\boolean{shownotes}}{% +%\section*{Author Notes} +\begingroup +\immediate\closeout\noteFile +\parindent=0pt +\input \jobname.notes +\endgroup +} +{\relax} +} + + diff --git a/include/language.sty b/include/language.sty new file mode 100644 index 0000000..54006ce --- /dev/null +++ b/include/language.sty @@ -0,0 +1,44 @@ +\ProvidesPackage{language} + +\RequirePackage{xstring} +\RequirePackage{xcolor} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% Looking for a consistent typography for language elements. + +\providecommand{\R}{} +\renewcommand{\R}{\mbox{\sf{R}}} +\providecommand{\RStudio}{} +\renewcommand{\RStudio}{\mbox{\sf{R}Studio}} +\providecommand{\Sage}{} +\renewcommand{\Sage}{\mbox{\sf{Sage}}} + +\providecommand{\variable}[1]{} +\renewcommand{\variable}[1]{{\color{green!50!black}\texttt{#1}}} +\providecommand{\dataframe}[1]{} +\renewcommand{\dataframe}[1]{{\color{blue!80!black}\texttt{#1}}} +\providecommand{\function}[1]{} +\renewcommand{\function}[1]{{\color{purple!75!blue}\texttt{\StrSubstitute{#1}{()}{}()}}} +\providecommand{\option}[1]{} +\renewcommand{\option}[1]{{\color{brown!80!black}\texttt{#1}}} +\providecommand{\pkg}[1]{} +\renewcommand{\pkg}[1]{{\color{red!80!black}\texttt{#1}}} +\providecommand{\code}[1]{} +\renewcommand{\code}[1]{{\color{blue!80!black}\texttt{#1}}} + +\providecommand{\file}[1]{} +\renewcommand{\file}[1]{{\tt #1}} + +% This looks really hokey. Probably need to redefine this. +\providecommand{\model}[2]{} +\renewcommand{\model}[2]{{$\,$\hbox{#1}\ \ensuremath{\sim}\ \hbox{#2}}} + +% These should be considered deprecated -- cease and disist +\providecommand{\VN}[1]{} +\renewcommand{\VN}[1]{{\color{green!50!black}\texttt{#1}}} +\providecommand{\vn}[1]{} +\renewcommand{\vn}[1]{{\color{green!50!black}\texttt{#1}}} +\providecommand{\DFN}[1]{} +\renewcommand{\DFN}[1]{{\color{blue!80!black}\texttt{#1}}} +\providecommand{\dfn}[1]{} +\renewcommand{\dfn}[1]{{\color{blue!80!black}\texttt{#1}}} + diff --git a/include/problems.sty b/include/problems.sty new file mode 100644 index 0000000..a8eef68 --- /dev/null +++ b/include/problems.sty @@ -0,0 +1,258 @@ +\NeedsTeXFormat{LaTeX2e}[1999/12/01] +\ProvidesPackage{amsprobs} + [2007/12/11 v0.1 problems package (R. Pruim (based on P.Pichaureau))] +%% \CharacterTable +%% {Upper-case \A\B\C\D\E\F\G\H\I\J\K\L\M\N\O\P\Q\R\S\T\U\V\W\X\Y\Z +%% Lower-case \a\b\c\d\e\f\g\h\i\j\k\l\m\n\o\p\q\r\s\t\u\v\w\x\y\z +%% Digits \0\1\2\3\4\5\6\7\8\9 +%% Exclamation \! Double quote \" Hash (number) \# +%% Dollar \$ Percent \% Ampersand \& +%% Acute accent \' Left paren \( Right paren \) +%% Asterisk \* Plus \+ Comma \, +%% Minus \- Point \. Solidus \/ +%% Colon \: Semicolon \; Less than \< +%% Equals \= Greater than \> Question mark \? +%% Commercial at \@ Left bracket \[ Backslash \\ +%% Right bracket \] Circumflex \^ Underscore \_ +%% Grave accent \` Left brace \{ Vertical bar \| +%% Right brace \} Tilde \~} +%% +\newif\if@AnswerOutput \@AnswerOutputtrue +\newif\if@AnswerDelay \@AnswerDelayfalse +\newif\if@ExerciseOutput \@ExerciseOutputtrue +\newif\if@ExerciseDelay \@ExerciseDelayfalse +\newif\if@AswLastExe \@AswLastExefalse +\newif\if@ShowLabel \@ShowLabelfalse +\newif\if@NumberInChapters \@NumberInChaptersfalse +\newif\if@NumberInSections \@NumberInSectionsfalse +\newif\if@DottedProbNumbers \@DottedProbNumbersfalse + +\DeclareOption{dotted} {\@DottedProbNumberstrue} +\DeclareOption{dotless} {\@DottedProbNumbersfalse} +\DeclareOption{noanswer} {\@AnswerOutputfalse} +\DeclareOption{answeronly} {\@ExerciseOutputfalse} +\DeclareOption{noexercise} {\@ExerciseOutputfalse} +\DeclareOption{exerciseonly} {\@AnswerOutputfalse} +\DeclareOption{outputnothing}{\@ExerciseOutputfalse\@AnswerOutputfalse} +\DeclareOption{exercisedelayed}{\@ExerciseDelaytrue} +\DeclareOption{answerdelayed}{\@AnswerDelaytrue} +\DeclareOption{lastexercise} {\@AswLastExetrue} +\DeclareOption{showlabel} {\@ShowLabeltrue} +\DeclareOption{chapter} {\@NumberInChapterstrue} +\DeclareOption{section} {\@NumberInSectionstrue} + +\ProcessOptions +\RequirePackage{keyval, ifthen} +\RequirePackage{xspace} + +\newbox\problemset@bin +\newbox\problem@bin +\newbox\solution@bin +\newbox\solutionset@bin +\newbox\studentsolution@bin +\newbox\studentsolutionset@bin +\global\setbox\problem@bin=\vbox{} +\global\setbox\problemset@bin=\vbox{} +\global\setbox\solution@bin=\vbox{} +\global\setbox\solutionset@bin=\vbox{} +\global\setbox\studentsolution@bin=\vbox{} +\global\setbox\studentsolutionset@bin=\vbox{} + +\def\renewcounter#1{% + \@ifundefined{c@#1} + {\@latex@error{counter #1 undefined}\@ehc}% + \relax + \let\@ifdefinable\@rc@ifdefinable + \@ifnextchar[{\@newctr{#1}}{}} + +\newcounter{problemNum} +\renewcommand{\theproblemNum}{\arabic{problemNum}} + +\if@NumberInSections + \renewcounter{problemNum}[section] + \renewcommand{\theproblemNum}{\thesection.\arabic{problemNum}}% +\fi + +\if@NumberInChapters + \renewcounter{problemNum}[chapter]% + \renewcommand{\theproblemNum}{\thechapter.\arabic{problemNum}}% +\fi + +\def\Rausskip{\ \vspace{-1\baselineskip}} +\def\Rausskip{\ \vspace{-.5\baselineskip}} + +\newenvironment{problem}% +{% +\refstepcounter{problemNum}% +%\begingroup% +\renewcommand{\labelenumi}{\textbf{\alph{enumi})}}% +\renewcommand{\labelenumii}{\roman{enumii}.}% +\renewcommand{\labelenumiii}{\Alph{enumiii}.}% +\renewcommand{\theenumi}{{\alph{enumi}}}% +\renewcommand{\theenumii}{\roman{enumii}}% +\renewcommand{\theenumiii}{\Alph{enumiii}}% +\global\setbox\problem@bin=\vbox\bgroup% +\noindent\textbf{\thechapter.\arabic{problemNum}.}% +}{% +\egroup% +\global\setbox\problemset@bin=\vbox{% +\unvbox\problemset@bin% + +\bigskip + +\unvbox\problem@bin% +%\endgroup% +} +}% + + + + +\newboolean{StudentSolution} +\newboolean{InstructorSolution} +\setboolean{StudentSolution}{false} +\setboolean{InstructorSolution}{true} +\newenvironment{solution}[1][\@empty]% +{% +% Do this by default +\setboolean{StudentSolution}{false} +\setboolean{InstructorSolution}{true} + +% Modify based on #1 +\ifthenelse{\equal{#1}{both}}{ + \setboolean{StudentSolution}{true} + \setboolean{InstructorSolution}{true}}% + {\relax} + +\ifthenelse{\equal{#1}{student}}{ + \setboolean{StudentSolution}{true} + \setboolean{InstructorSolution}{false}}% + {\relax} + +\ifthenelse{\equal{#1}{instructor}}{ + \setboolean{StudentSolution}{false} + \setboolean{InstructorSolution}{true}}% + {\relax} + +\renewcommand{\labelenumi}{\textbf{\alph{enumi})}}% +\renewcommand{\labelenumii}{\roman{enumii}.}% +\renewcommand{\labelenumiii}{\Alph{enumiii}.}% +\renewcommand{\theenumi}{{\alph{enumi}}}% +\renewcommand{\theenumii}{\roman{enumii}}% +\renewcommand{\theenumiii}{\Alph{enumiii}}% +\renewcommand{\labelenumii}{\textbf{\alph{enumii})}}% +\renewcommand{\labelenumiii}{\roman{enumiii}.}% +\renewcommand{\labelenumiv}{\Alph{enumiv}.}% +\renewcommand{\theenumii}{{\alph{enumii}}}% +\renewcommand{\theenumiii}{\roman{enumiii}}% +\renewcommand{\theenumiv}{\Alph{enumiv}}% +\global\setbox\solution@bin=\vbox\bgroup% +%\noindent\textbf{Solution \thechapter.\arabic{problemNum}. }% +%\begin{enumerate} +%\item[\textbf{\thechapter.\arabic{problemNum}.}]% +\noindent \textbf{\thechapter.\arabic{problemNum}. }% +}{% +%\end{enumerate} +\egroup% +% +% save to instructor solution set (if we should) +% +\ifthenelse{\boolean{InstructorSolution}}{% +\global\setbox\solutionset@bin=\vbox{% +\unvbox\solutionset@bin% + +\bigskip + +\unvcopy\solution@bin% +}}{\relax}% +% +% save to student solution set (if we should) +% +\ifthenelse{\boolean{StudentSolution}}{% +\global\setbox\studentsolutionset@bin=\vbox{% +\unvbox\studentsolutionset@bin% + +\medskip + +\unvbox\solution@bin% +} +}{\relax} +}% + +\newenvironment{studentsolution}[1][\@empty]% +{% +%\begingroup% +\def\paramOne{#1} +\renewcommand{\labelenumii}{\textbf{\alph{enumii})}}% +\renewcommand{\labelenumiii}{\roman{enumiii}.}% +\renewcommand{\labelenumiv}{\Alph{enumiv}.}% +\renewcommand{\theenumii}{{\alph{enumii}}}% +\renewcommand{\theenumiii}{\roman{enumiii}}% +\renewcommand{\theenumiv}{\Alph{enumiv}}% +\global\setbox\studentsolution@bin=\vbox\bgroup% +\begin{enumerate} +\item[\textbf{\thechapter.\arabic{problemNum}.}]% +}{% +\end{enumerate} +\egroup% +\global\setbox\studentsolutionset@bin=\vbox{% +\unvbox\studentsolutionset@bin% + +\medskip + +\unvbox\studentsolution@bin% +%\endgroup% +} +}% + +\newenvironment{bothsolution} +{% +\renewcommand{\labelenumii}{\textbf{\alph{enumii})}}% +\renewcommand{\labelenumiii}{\roman{enumiii}.}% +\renewcommand{\labelenumiv}{\Alph{enumiv}.}% +\renewcommand{\theenumii}{{\alph{enumii}}}% +\renewcommand{\theenumiii}{\roman{enumiii}}% +\renewcommand{\theenumiv}{\Alph{enumiv}}% +\global\setbox\solution@bin=\vbox\bgroup% +\begin{enumerate} +\item[\textbf{\thechapter.\arabic{problemNum}.}]% +}{% +\end{enumerate} +\egroup% +\global\setbox\solutionset@bin=\vbox{% +\unvbox\solutionset@bin% + +\bigskip + +\unvcopy\solution@bin% +} +\global\setbox\studentsolutionset@bin=\vbox{% +\unvbox\studentsolutionset@bin% + +\medskip + +\unvbox\solution@bin% +} +}% + +\def\shipoutProblems{% +%\begin{xcb}{Exercises} +\unvbox\problemset@bin +\unvbox\problem@bin +%\end{xcb} +} + +\def\shipoutSolutions{% +\unvbox\solutionset@bin +\newpage +} + +\def\shipoutStudentSolutions{% +\unvbox\studentsolutionset@bin +\newpage +} + +\endinput + + + diff --git a/include/probstat.sty b/include/probstat.sty new file mode 100644 index 0000000..9231ad5 --- /dev/null +++ b/include/probstat.sty @@ -0,0 +1,350 @@ + +\ProvidesPackage{probstat} +\RequirePackage{amsmath} +\RequirePackage{ifthen} +\RequirePackage{amsmath} +\RequirePackage{bm} +\RequirePackage{xcolor} +\RequirePackage{fancyvrb} + +\newboolean{longExp} +\setboolean{longExp}{false} +\DeclareOption{longExp}{\setboolean{longExp}{true}} +\DeclareOption{shortExp}{\setboolean{longExp}{false}} + +\ProcessOptions* + +\newcommand{\term}[1]{\textbf{#1}} +\newcommand{\code}[1]{{\tt #1}} +\newcommand{\file}[2][R]{{\tt #2}} +\newcommand{\command}[1]{\texttt{#1}} +%\newcommand{\R}{\mbox{\texttt{R}}} +\newcommand{\R}{\mbox{\sf{R}}} + +\newlength{\cwidth} +\newcommand{\cents}{\settowidth{\cwidth}{c}% +\divide\cwidth by2 +\advance\cwidth by-.1pt +c\kern-\cwidth +\vrule width .1pt depth.2ex height1.2ex +\kern\cwidth} + +\def\myRuleColor{\color{blue!45!white}} +\colorlet{myRuleColor}{blue!45!white} +\def\includeR#1{% +\typeout{Including R output from #1} +\VerbatimInput[framerule=.5mm, + frame=leftline, + rulecolor=\myRuleColor, + fontsize=\small]{#1} +} +\def\includeRtiny#1{% +\typeout{Including R output from #1} +\VerbatimInput[framerule=.5mm, + frame=leftline, + rulecolor=\myRuleColor, + fontsize=\footnotesize]{#1} +} + + +\DefineVerbatimEnvironment% +{Rcode}{Verbatim} +{framerule=.5mm,frame=leftline,rulecolor=\myRuleColor,fontsize=\small} + +\DefineVerbatimEnvironment% +{tinyRcode}{Verbatim} +{framerule=.5mm,frame=leftline,rulecolor=\myRuleColor,fontsize=\tiny} + +\DefineVerbatimEnvironment% +{footRcode}{Verbatim} +{framerule=.5mm,frame=leftline,rulecolor=\myRuleColor,fontsize=\footnotesize} + +\def\includeRaus#1{% + +\hfill \makebox[0pt]{\fbox{\tiny #1}} +\vspace*{-3ex} + +%\xmarginpar{\fbox{\tiny #1}}% +\includeR{Rout/#1.Raus} +%\hfill +%\rule{1in}{.3pt} +%\rule{1in}{.3pt} +%\hfill +} +\def\includeRchunk#1{% + +\hfill \makebox[0pt]{\fbox{\tiny #1}} +\vspace*{-3ex} + +%\xmarginpar{\fbox{\tiny #1}}% +\includeR{Rchunk/#1.Rchunk} +%\hfill +%\rule{1in}{.3pt} +%\rule{1in}{.3pt} +%\hfill +} + +\def\includeRausTwo#1{% + +\hfill \makebox[0pt]{\fbox{\tiny #1}} +\vspace*{-3ex} + +\begin{multicols}{2} +\includeR{Rout/#1.Raus} +\end{multicols} +} + + +%% basic probability stuff +\newcommand{\E}{\operatorname{E}} +\newcommand{\Prob}{\operatorname{P}} +\def\evProb#1{\Prob(\mbox{#1})} +\newcommand{\Var}{\operatorname{Var}} +\newcommand{\coVar}{\operatorname{Cov}} +\newcommand{\Cov}{\operatorname{Cov}} +\newcommand{\covar}{\operatorname{Cov}} +\newcommand{\argmax}{\operatorname{argmax}} +\newcommand{\argmin}{\operatorname{argmin}} + +\newcommand\simiid{\stackrel{\tiny \operatorname{iid}}{\sim}} + +\newcommand{\distribution}[1]{{\textsf{#1}}} +\gdef\Bin{\distribution{Binom}} +\gdef\Binom{\distribution{Binom}} +\gdef\Multinom{\distribution{Multinom}} +\gdef\NBinom{\distribution{NBinom}} +\gdef\Geom{\distribution{Geom}} +\gdef\Norm{\distribution{Norm}} +\gdef\Hyper{\distribution{Hyper}} +\gdef\Unif{\distribution{Unif}} +\ifthenelse{\boolean{longExp}}{% + \gdef\Exponential{\distribution{Exp}}% + }{% + \gdef\Exp{\distribution{Exp}}% + } +\gdef\Poisson{\distribution{Pois}} +\gdef\Pois{\distribution{Pois}} +\gdef\Gam{\distribution{Gamma}} +\gdef\Gamm{\distribution{Gamma}} +\gdef\Beta{\distribution{Beta}} +\gdef\Weibull{\distribution{Weibull}} +\gdef\Chisq{\distribution{Chisq}} +\gdef\Tdist{\distribution{T}} +\gdef\Fdist{\distribution{F}} + + +\def\mean#1{\overline{#1}} + +\def\Prob{\operatorname{P}} +\ifthenelse{\boolean{longExp}}{% + \def\Exp{\operatorname{E}} + }{% + \def\E{\operatorname{E}} + } +\def\Var{\operatorname{Var}} +\def\SD{\operatorname{SD}} + + +%% ANOVA abbreviations +\def\SE{SE} +\def\SSe{SSE} +\def\SSTot{SSTot} +\def\SSr{SSM} +\def\SSM{SSM} +\def\SSx{SS_x} +\def\SSy{SS_y} +\def\Sxy{S_{xy}} +\def\Sxx{S_{xx}} +\def\Syy{S_{yy}} + + + +%% some colors +\colorlet{trCol}{green!50!black} +\colorlet{erCol}{red!70!black} +\colorlet{meanCol}{orange!90!black} +\colorlet{adjCol}{blue!80!black} +\colorlet{fitCol}{purple} + +%% some vector stuff + +\newcommand{\rowvec}[1]{{\left[ #1 \right]}} +\newcommand{\transpose}[1]{{#1}^T} +\newcommand{\colvec}[1]{\transpose{\rowvec{#1}}} +\def\vec#1{\bm{#1}} +\def\mat#1{\bm{#1}} +\newcommand{\vecarray}[2][black]{ +\textcolor{#1}{ +\left[ +\begin{array}{r} + #2 +\end{array} +\right] +} +} + +\newenvironment{brackmat}% +{% +\left[ +\begin{matrix} +}{% +\end{matrix} +\right] +} + +%\newcommand{\D}[2]{\frac{\partial}{\partial #2}#1} + + + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% some hacks and kludges +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\def\lhd{\mathrel{\vartriangleleft}} +\def\unlhd{\mathrel{\trianglelefteq}} +\def\Box{\mathrel{\square}} +\def\QED{\hfill\mbox{$\Box$}} + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% macros from schoening +% with modifications +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +%% relations + +\def\wiggle{\sim} +\def\wiggles{\wiggle} +\def\approxwiggle{\stackrel{\cdot}{\wiggle}} + +%% sets of numbers + +\def\WholeNumbers{\mbox{$\mathbb W$}} +\def\WholeNums{\mbox{$\mathbb W$}} +\def\Naturals{\mbox{$\mathbb N$}} +\def\NatNums{\mbox{$\mathbb N$}} +\def\natNums{\mbox{$\mathbb N$}} +\def\natNumbers{\mbox{$\mathbb N$}} +\def\NatNumbers{\mbox{$\mathbb N$}} +\def\Reals{\mbox{$\mathbb R$}} +\def\reals{\mbox{$\mathbb R$}} +\def\Jset{\mbox{$\mathbb J$}} +\def\Integers{\mbox{$\mathbb Z$}} +\def\integers{\mbox{$\mathbb Z$}} +\def\rationals{\mbox{$\mathbb Q$}} +\def\Rationals{\mbox{$\mathbb Q$}} + + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% misc +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\def\dom#1{\mbox{dom}} + +\def\seq#1#2{\{#1_{#2}\}_{#2=1}^{\infty}} +\def\seqg#1#2#3{\{#1\}_{#2=#3}^{\infty}} + + +\gdef\makemath#1{\ifmmode #1 \else $ #1 $\fi} +\def\ignore[1]{\relax} +\def\ds{\displaystyle} + +\def\varp{\varphi} + +%\def\options#1#2{ {\tt [ #1 ]/[ #2 ]} } +\def\abs#1{{\mid #1 \mid}} + +% this interferes with the definition in amstheorem and ntheorem +%\def\qed{$\Box$} + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% 'function-like' defs +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +\def\floor#1{\lfloor #1 \rfloor} +\def\ceiling#1{\lceil #1 \rceil} +\def\falling#1#2{#1^{\underline{#2}}} +\def\rising#1#2{#1^{\overline{#2}}} +\def\pair#1{\langle #1 \rangle } +\def\Pair#1{\left\langle #1 \right\rangle } +\def\tuple#1{{\langle {#1} \rangle}} +\def\length#1{\vert #1 \vert} +\def\boolval#1{\lbrack\!\lbrack #1 \rbrack\!\rbrack} +\def\set#1{\{#1\}} +\def\ctblset#1#2{\set{ #1_{#2} \mid #2 \in \omega }} +\def\card#1{\vert #1 \vert} +\def\size#1{\vert #1 \vert} +\def\norm#1{\| #1 \|} +\def\ket#1{{|{#1} \rangle}} +\def\bra#1{{\langle {#1}|}} +\def\braket#1#2{{\langle {#1} \mid {#2} \rangle}} +%\newcommand{\complement}[1]{\makemath{#1^{c}}} +\newcommand{\comp}[1]{{#1}^{c}} + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% symbols for manipulating strings, sets and functions +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\def\garthCases#1{\left\{\mbox{\begin{tabular}{@{$}l@{$\qquad}l} #1 \end{tabular}}\right.} + +%for example: +% +%$$ A = \garthCases{ \sigma & if something,\\ \sigma' & otherwise. } $$ +% +% -- garth + +\def\powerset{\mbox{$\cal P$}} +\def\powerSet{\mbox{$\cal P$}} +\def\EmptySet{\emptyset} +\def\emptySet{\emptyset} +\def\emptystring{\lambda} +\def\emptyString{\emptystring} +\def\concat{^\frown} +\def\substring{\sqsubset} +\def\supstring{\sqsupset} +\def\substringeq{\sqsubseteq} +\def\supstringeq{\sqsupseteq} +%\def\substringnoteq{{\sqsubset \atop \not=}} +%\def\supstringnoteq{{\sqsupset \atop \not=}} +%\def\subsetnoteq{{\subset \atop \not=}} +%\def\supsetnoteq{{\supset \atop \not=}} +\def\substringnoteq{\substring} +\def\supstringnoteq{\substring} +\def\subsetnoteq{\subset} +\def\supsetnoteq{\supset} + +\def\intersect{\cap} +\def\Intersect{\bigcap} +\def\union{\cup} +\def\Union{\bigcup} +\def\symdif{\bigtriangleup} +\def\setminus{-} + +\def\compose{\circ} +\def\restricted{\makemath{|\!\grave{\;}}} +\def\restrictedto{{|\!\grave{\;}}} + + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% logic stuff +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\def\proves{\vdash} +\def\provedby{\dashv} + +\def\implies{\Longrightarrow} + +\def\succ{{\rm succ}} +\def\divg{{\!\uparrow}} +\def\conv{{\!\downarrow}} +\def\domain{{\rm domain}} +\def\range{{\rm range}} + +\def\setof#1{{\left\{{#1}\right\}}} + +\newcommand{\tand}{\mbox{\ and\ }} +\newcommand{\tor}{\mbox{\ or\ }} + diff --git a/include/setup.R b/include/setup.R index 9eba420..ab01749 100644 --- a/include/setup.R +++ b/include/setup.R @@ -6,13 +6,13 @@ require(datasets) require(stats) require(lattice) require(grid) -# require(fastR) # commented out by NJH on 7/12/2012 require(mosaic) +require(mosaicData) trellis.par.set(theme=col.mosaic(bw=FALSE)) trellis.par.set(fontsize=list(text=9)) options(keep.blank.line=FALSE) options(width=60) -require(vcd) +# require(vcd) # went away 11/5/2015 by njh require(knitr) opts_chunk$set( tidy=TRUE, size='small',