website/code.html at master · hillarysanders/website · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
        <title>
            Hillary Sanders
        </title>
        <meta name="author" content="Hillary Sanders">
        <meta name="keywords" content="Hillary Sanders Statistics Berkeley Premise Data Scientist Art Bayesian Painting">
        <meta name="description" content="Hillary Sanders: Data Scientist?">
        <meta name="robots" content="index, follow, noarchive">

      <link type="text/css" rel="stylesheet" href="css/reset.css"/>
      <link type="text/css" rel="stylesheet" href="css/mainstylesheet.css"/>
      <style>
         .midwrap {
            min-height: 1200px;
         }
      </style>
   </head>

   <body>

      <div class="midwrap">
         <div class="header codex basic"><h1>hillary sanders</h1></div>

         <div class="navbox">
            <ul class="navlist">
               <a href="index.html"><li class="navitem basic">about hillz</li></a>
               <a href="drawings.html"><li class="navitem basic">drawings</li></a>
               <a href="paintings.html"><li class="navitem basic">paintings</li></a>
               <a href="mixed-media.html"><li class="navitem basic">mixed media</li></a>
               <a href="cheatsheets.html"><li class="navitem basic">cheatsheets</li></a>
               <a href="code.html"><li class="navitem basic">code</li></a>
               <a href="graphs.html"><li class="navitem basic">graphs</li></a>
               <a href="resume.html"><li class="navitem basic">resume</li></a>
            </ul>
         </div>

         <div class="threecol thick">
            <h2> The coolest stuff </h2>
          <p> Well damn, it's confidential, sorry.
        </p>


          <h2> Homegrown Random Forests </h2>
        <p>
          This was part of a fun machine learning project I worked on with two other students from Berkeley. Basically we tried to predict baseball players' success by the players' past stats. We implemented a plethora of classical different machine learning methods, using only base-R code (no packages to do your black box work for you!). I wrote a random forest learner from scratch.Code can be found
          <a href="pics/random_forest.R" title="Random Forest Code"> <font color="#001070"> here </font> </a> and <a href="pics/random_forest_script.R"> <font color="#001070">
          here </font> </a>.
        </p>


          <h2> Creating an R Package: Legistative Text Mapping and Directed Acyclic Graphs </h2>
          <p> From 2011 to 2012 I worked with the fantastic Mark Huberty, a graduate student in Political Science at the Travers Department at UC Berkeley (who is awesome and basically is the reason I got interested in doing what I'm now happily doing with my life. Props.) I helped develop an original R package to map the evolution of legislation from its introduction as a bill, through the amendment process, until finalization. There are five core pieces of functionality in the Leghist package: <p>
          <p> 1. Mapping of amendments to their locations in legislative bills. </br>
          2. Identification of discarded material. </br>
          3. Mapping of sections between editions of bills.</br>
          4. Modeling of the content of added and discarded material.</br>
          5. Visualization of the flow of bill content, by subject matter.
            </p>
          <p> Although I was somewhat involved in all parts of the above, I wrote the code for 5). There were two master functions that I created for the package. Both took raw output from 1:4), and created customizable, yet automated, directed ayclic graphs, implemented through the R package igraph. I also wrote automated scripts to test these functions' functionality.
          </p>
          <p>
          I also worked on figuring out how to document (by implementing Roxygen2), test, and finally create the R package in an automated fashion.
          </p>

         </div>

         <div class="threecol thick">
            <h2> Using Tweets and Bayesian Statistical Analysis to Model the 2012 Presidential Election </h2>
          <p> <a href="images/twitter_paper/midterm.pdf" title="Fun With Twitter!"> <font color="#404043"> This was one of those class projects that was just way too much fun. I wrote a paper describing this project, which this text links to. Basically, I created a forecasting model which predicts state-level vote-share probabilities by using a hierarchical Bayesian model to incorporate the simple text analysis of state-specific tweets into predictions. The model used Markov chain Monte Carlo methods to develop the final posterior distribution. Model priors were based off of state-level 2004 and 2008 vote-share data. Data consisted of recent tweets mentioning 'Obama' or 'Romney'. Although the simple text analysis of tweets is not a perfect substitute for polling data (problems will be discussed below), it offers a potential way to bolster political forecasting models. </font> </a>
          </p>


          <h2> Voting with your Tweet: Forecasting elections with social media data; Broadcasting live predictions online</h2>
          <p> This project broadcasted live, out-of-sample congressional elections predictions based on Mark Huberty's SuperLearner-based algorithm which takes tweets as input. I helped Mark (who is awesome and taught me nearly everything) by cleaning up code, writing a bit myself, and gathering congressional candidate data.
          </p>


          <h2> Automated Pulling, XML Parsing, and Visualization of World Bank Country-wise Economic Indicators.</h2>
          <p> This was actually for a school project. I worked with a good group of kids, and we all took our own chunks of the intended project and just sort of ran with it. My goal was to completely automate, and make easily adjustable, the automated pulling of World Bank data, and do the same for some really awesome looking, and informative (!) graphs. I would show you the pretty pictures, but I left my laptop out in the rain, so...
          </p>
         </div>


         <div class="threecol thick">

            <h2> Automated Visualizations of Medicare Data </h2>
            <p> This is part of what I worked on for the my summer (2012) internship with Acumen LLC. I wrote various flexible and adjustable R functions which take as input excel workbooks, which each function parses, organizes, and plots in some unique way.
            </p>
            <p> One function I wrote plots normalized values of multiple variables over all districts or all states in the United States with a segmented scatter plot. A user-chosen state is highlighted and its values shown. All points above a user-chosen number of standard deviations from the mean become two-letter state abreviations, or three-letter district abreviation. A wrapper function I made enables users to save all 51 (state) or 90 (district) graphs into a single file in an automated fashion, in various forms (jpeg, pdf). The graphs are highly customizable - they can take a flexible number of table rows and columns, and colors, line thicknes, labels, scatter segmentation, etc. is adjustable.
            </p>
            <p> Another function I made allows viewers to spot the professional relationships among (often many thousands of) doctors. First, I created a base distance metric to represent professional closeness among two doctors Di and Dj. This involved variables like the number of beneficiaries Di and Dj share, weighted by billing, the percentage of benificiaries Di and Dj share with regards to their own unique beneficiary service count, weighted by billing, etc.
            </p>
            <p> This function implements the R igraph package. It takes as input information regarding many thousands of doctors and plots their relationships, using a user-chosen algorithm. Users are able to modify the way the doctor relationship metrics (input to the relationship strength algorithms) are calculated. Users can also choose up to two percentile relationship cutoffs to modify how doctor nodes and realtionships are colored and sized (e.g. all realtionsips with percentile rating above 99.9% can be enlarged and colored red).
            </p>
         </div>


      </div>

   </body>
</html>