Assignment #1
STAT 850
Fall 2016
Complete the following problems below. Within each part, include your SAS program code, all corresponding output, and any additional information needed to explain your answer. Your SAS code and output should be formatted in a manner similar to the lecture notes.
(35 total points) The National Football League (NFL) holds a scouting combine every year for college football players who would like to play football in the NFL. These players go through a number of evaluations during the combine so that NFL teams can assess their ability. For more information, please see http://en.wikipedia.org/wiki/NFL_Scouting_Combine and http://www.nfl.com/combine.
The nfl_combine_2014_noNA.csv data file is available from my course website, and it contains information on some of the players who participated in the 2014 combine. The columns in the data file represent the following information:
Player: Name of player being evaluated
College: College that the player attended
Position: The position of the player where DB = defensive back, LB = linebacker, OL = offensive linemen, RB = running back, S = safety, TE = tight end, WO = wide receiver; players who played other positions were excluded from the data file
OverallGrade: The overall grade of the player based on the evaluations
Height: Height in inches
ArmLength: Arm length in inches
Weight: Weight in pounds
Dash40: 40yard dash time in seconds
BenchPress: Number of bench press repetitions of 225 pounds
VerticalJump: Vertical jump in inches
BroadJump: Broad jump in inches
Cone3Drill: 3cone drill time in seconds
Shuttle20: 20yard shuttle run in seconds
Using these data, complete the following problems below. While you are welcome to use any football knowledge to help answer questions, this is not needed to perform well on this assignment.
(4 points) Read the data into SAS using proc import and print the first five observations using proc print.
(4 points) Sort the players by their 40yard dash times. Print only the names of each offensive linemen with their 40yard dash times.
(4 points) Find the mean 40yard dash times for the players by each position. Which position has the fastest players on average? Which position has the slowest players on average?
Using proc means, find the mean, standard deviation, and sample size for the 40yard dash times of both offensive linemen and wide receivers (each separately). Export these values into a data set the following ways:
(3 points) output statement
(3 points) ods statement
(5 points) Assuming these players are a simple random sample from a population of all players, we can perform statistical inference procedures to make inferences about this population. With this assumption, perform a twosample ttest with unequal variances to test the equality of means for the 40yard dash times of offensive linemen and wide receivers. More formally, we can create the hypotheses as
H_{0}: _{OL} – _{WR} = 0
H_{a}: _{OL} – _{WR} 0 where _{position} denotes the mean for particular position. This hypothesis test should be performed by showing the correct test statistic and pvalue equations with their values AND without using a SAS procedure to automatically find these values. Your writeup here should be formal by including statements of hypotheses, test statistic, pvalue, critical value, and decision with reasoning.
This problem involves a new procedure, proc ttest, to perform the same types of calculations as in part e.
(3 points) Show the main syntax help page available in SAS for this procedure. A screen capture will suffice to obtain full credit.
(3 points) Perform the calculations for the test using proc ttest. Indicate where the key components of the output are that allows one to perform the test.
(3 points) Use ods trace to determine what is the appropriate table name that contains the pvalue for the test.
(3 points) Use the ods statement to create a data set with the pvalue for the test. Print this data set.
(20 total points) “Stability testing” is performed by pharmaceutical companies to determine the shelf life for drug products. Typically, part of a drug batch (like a number of pills) is put into storage in a controlled temperature and humidity environment. At regular time points, an item is taken out of storage and testing is performed on it. A common response measured on each item is potency. Over time, the potency of a drug will usually degrade, so the Food and Drug Administration (FDA) has set a 95% lower limit of the desired potency level which the drug needs to remain above. The exact time point where the drug goes below this limit is the shelf life. This shelf life (say, 4 months) then is added to the manufacturing date of a drug to find the expiration date, which is what consumers often see printed on drug packaging.
The shelf life is found with the help of regression models. To show how this done, below is a simulated data set where the potency of a drug has been measured over time in months. Suppose a single pill has been measured at each time point.
Time
 Potency
 3
 1.0155450
 6
 0.9835495
 9
 0.9957994
 12
 0.9836627
 15
 0.9863230
 18
 0.9945146
 21
 0.9995710
 24
 0.9679062
 30
 0.9690051
 36
 0.9891509
 48
 0.9674187
 60
 0.9557498

For example, the pill taken out of storage at time 3 months had a potency of 100.92% of the desired potency level. Using this data, complete the following problems.
(4 points) Use a data step with the datalines statement to create a SAS data set containing the data in the previous table. Print the data set using proc print.
(5 points) Estimate and state the sample regression model with time as the explanatory variable and potency as the response variable. Use proc reg to perform the estimation and make sure that no plots are produced by the procedure. Interpret the relationship between time and potency as given by the model.
(4 points) Is there sufficient evidence to indicate a linear relationship between time and potency? Use the appropriate statistical inference methods to make this judgment.
(4 points) Use proc reg again as in part b, but include the plot with 95% confidence interval bands for the expected potency. No other plots should be included in the output! I recommend using the SAS help to determine the correct coding specification.
(3 points) The FDA has guidelines to determine the shelf life of a drug. Specifically, a 95% confidence interval band plot (like in part d) is used to find the time where the lower band intersects a horizontal line drawn at a 95% potency level. The corresponding time point where this occurs is the shelf life. Using the plot in part d, approximate what the shelf life would be for this data. Note that you do not need to use SAS to draw the line at a 95% potency level.
