Title: | GC Statistics Datasets |
---|---|
Description: | A small, informal collection of datasets useful in undergraduate statistics courses. |
Authors: | Homer White <[email protected]> |
Maintainer: | Homer White <[email protected]> |
License: | GPL (>=3) |
Version: | 0.2.1 |
Built: | 2024-11-24 22:20:53 UTC |
Source: | https://github.com/homerhanumat/tigerData |
2017 version of data set for textbook Modern Data Science With R (see Chapter 9).
A data frame with 1103 observations on the following 7 variables.
manufacturer of car
car model
Engine Displacement (Liters)
number of cylinders
fuel economy (mpg), city
fuel economy (mpg), highway
number of gears
Subset of survey data collected by the US National Center for Health Statistics (NCHS). The original data was based on home interviews of about 5,000 people per years, from 1999-2004.
A data frame with 9096 observations on the following 23 variables.
sex
"male"
or "female"
age
age of subject in years
pregnant
"yes"
or "no"
ethnicity
Mexican American, Other Hispanic, Non-Hispanic White, Non-Hispanic Black, or Other/Multi
smoker
"yes"
or "no"
diabetic
"yes"
or "no"
height
height (meters)
weight
weight (kilograms)
waist
waist circumference (meters)
wci
the proposed body shape index
bmi
body mass index
ptfat
percent trunk fat
tfat
mass of trunk fat
lfat
limb fat
llean
limb lean tissue
lbmi
lean-tissue only BMI
fbmi
fat-only BMI
bbmi
bone BMI
pfat
percent fat
bmd
bone mineral density
fmhm_other
Framingham risk score
hdl
HDL cholesterol
chol
cholesterol (LDL?)
bps
systolic blood pressure, mmHg
bpd
diastolic blood pressure, mmHg
income
ratio of family income to poverty threshold. (5 stands for a ratio greater than or equal to 5)
Modified from NCHS
in package DataComputing
.
The original data is from NHANES, the National Health and
Nutrition Survey.
See http://wwwn.cdc.gov/nchs/nhanes/search/nhanes03_04.aspx#
for more infromation.
Donations made to a fictional Zen Center.
In a family with participant
, event
and eventParticipation
.
A data frame with 66 observations on the following 5 variables.
Donation ID
Participant ID
Amount of donation, in dollars.
Date of donation.
A character vector with two values: cash
and check
.
Hypothetical data.
Events at a fictional Zen Center.
In a family with participant
, donation
and eventParticipation
.
A data frame with 15 observations on the following 7 variables.
Event ID
Type of event. A factor with levels potluck
,
retreat
and sangha
.
Location of the event.
Start-time of the event.
Ending time of the event.
Nominal cost of the event, in dollars.
ID of the participant who organizes the event.
Hypothetical data.
Perticipation of persons in events associated with a
fictional Zen Center.
In a family with participant
, donation
and event
.
A data frame with 107 observations on the following 5 variables.
Participant ID
ID of the participant.
Amount actually paid by the participant.
Miscellaneous comments
Whether or not the participant actually attended.
Hypothetical data.
Modifed from a dataset obtained in the course of a study on factors that are associated with fire-setting among at-risk youth. The data comes from national surveys of at-risk teenagers.
A data frame with 975 observations on the following 6 variables.
Child's age in years.
Sex of the child.
Child's race.
A measure of child's perceptions about school, combined from surveys given to child and to his/her parents. Higher scores indicate poorer attitudes.
A measure of the child's academic performance. Higher scores indicate more academic problems.
Scaled scores from a test for AHDH. Higher scores indicates more problems with ADHD.
Whether or not the child sets fires (0 = does not, 1 = does).
Doctroal dissertation by Carrie H. Bowling, University of Kentucky, 2013.. Further details in ../doc/firesetting_phd_proposal.pdf.
Subset of data from a study on sea snails of the genus Haliotis. The age of a such a snail (in years) is quite close to the number of rings it has on its shell plus 1.5. The idea is to predict the age of an individual sea snail from other characteristics. The rest of the data is witheld for evalution purposes.
A data frame with 2923 observations on the following 9 variables.
male, femal or infant
length of longest shell
diameter
height
weight when whole
weight when shucked
weight of the snail's viscera
weight of the snails's shell
number of rings, plus 1.5
From the ICU website: "The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope – a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age."
Dataset derived from the UCI Machine Learning Repository. See http://archive.ics.uci.edu/ml/datasets/Abalone for more information, including a citation of the original research article.
Georgetown College students surveyed their peers on attitudes about LBGT issues and persons.
A data frame with 75 observations on the following 22 variables.
From Hunter Gatewood (email): "It was coded mainly based upon the numbers stated within the questions. For question 7 (on the front), if they answered no for question 6 (front), we coded it as 5. For question 11, we had to separate it between 'for a short time' (1-3 semesters) and 'for a longer time (4+ semesters) with coding (1,2)."
Hunter Gatewood, Molly Dixon, Almond Bailey. Class: PSY 311, Fall 2014. Instructor: Dr. Regan Lookadoo. For survey form see ../doc/lbgt_likert.pdf.
Results of the Survey as of 2018.
A data frame with 9842 observations on the following 18 variables.
id
An ID number for the subject
gender
"male"
or "female"
age
age of subject in years
arthritis
Was the subject ever told he/she had arthritis? (yes, no)
edu
level of education attained
married
marital status
income
income level
cholesterol
blood cholesterol (mmol/L)
glucose
glucose, refridgerated serum (mmol/L)
iron
iron (umol/L)
sodium
sodium (mmol/L)
weight
weight (kg)
systolic
systolic blood pressure, mm/Hg
diastolic
diastolic blood pressure, mm/Hg
asthma
Does the subject have asthma? (yes, no)
heartattack
Was the subject ever told he/she had a heart attack? (yes, no)
liver
Was the subject ever told he/she had a liver condition? (yes, no)
cancer
Was the subject ever told he/she had a cancer or malignancy? (yes, no)
NHANES, the National Health and Nutrition Survey. See http://wwwn.cdc.gov/nchs/nhanes/search/nhanes03_04.aspx# for more infromation.
A subset of the parking
data frame, giving only the
subject involved in the experiment. In the experiment, parked cars
were approached by either an expensive car or a cheap one. The approaching
car waited for the spot, and while waiting either honked once or did not
honk at all.
A data frame with 237 observations on the following 12 variables.
The type of car that was waiting for the parking spot (or that just drove by). Either a Nissan Maxima or an Infinity Q45. The car is "confronting" the parked car, hence the name of the variable.
Sex of the driver of the parked car.
Race of the driver of the parked car.
Number of people in the parked car (including the driver).
Thhe waiting car either honked the horn once, or did not honk at all.
Book value of the parked car, in dollars.
Month in which the incident occurred.
Day of the week on which the incident occurred.
Time at which the incident occurred, in military units. For example, 1130 denotes 11:30AM, while 1350 denotes for 1:50PM.
Time in seconds for the parked car to depart the parking spot.
Status of the waiting "confronting" car. The Maxima is considered a low-status car, whereas the Infinity Q45 is an expensive, "high-status" car.
Difference in value between the confronting car and the parked car, in dollars. The values of the confronting cars were as follows: Maxima: 5200, Infinity Q45: 57000.
This is almost the orginal data. B. Ruback indicates (personal communication) that several observations are missing and cannot be recovered at the present time.
"Territorial Defense in Parking Lots: Retaliation Against Waiting Drivers", B. Ruback and D. Juieng, Journal of Applied Social Psychology, Volume 27, Issue 9, May 1997, pp. 821-834. Provided by B. Ruback.
A study of how long it takes a driver to vacate his/her spot in a parking lot.
A data frame with 237 observations on the following 12 variables.
The type of car that was waiting for the parking spot (or that just drove by). Either a Nissan Maxima, a Lexus or an Infinity Q45. The car is "confronting" the parked car, hence the name of the variable.
Sex of the driver of the parked car.
Race of the driver of the parked car.
Number of people in the parked car (including the driver).
The waiting car either did not intrude on the parked car at all, intruded only slightly by driving by, or stopped near the parking spot and waited. In that case the waiting car either honked the horn once, or did not honk at all.
Book value of the parked car, in dollars.
Month in which the incident occurred.
Day of the week on which the incident occurred.
Time at which the incident occurred, in military units. For example, 1130 denotes 11:30AM, while 1350 denotes for 1:50PM.
Time in seconds for the parked car to depart the parking spot.
Status of the waiting "confronting" car. The Maxima is considered a low-status car, whereas the Lexus and Infinity Q45 are expensive, "high-status" cars.
Difference in value between the confronting car and the parked car, in dollars. The values of the confronting cars were s follows: Maxima: 5200, Lexus: 43000, Infinity Q45: 57000.
This is almost the orginal data. B. Ruback indicates (personal communication) that three observations are missing and cannot be recovered at the present time.
"Territorial Defense in Parking Lots: Retaliation Against Waiting Drivers", B. Ruback and D. Juieng, Journal of Applied Social Psychology, Volume 27, Issue 9, May 1997, pp. 821-834. Provided by B. Ruback.
Persons associated with fictional Zen Center.
In a family with event
, donation
and eventParticipation
.
A data frame with 11 observations on the following 11 variables.
Participant ID
Type of the participant. A factor with levels member
and visitor
.
First name of participant.
Middle name of participant.
Last name of participant.
Street address of participant.
City of participant.
State of participant.
Postal code of participant.
Email address of participant.
Phone number of participant.
Hypothetical data.
Will people using a public pay-phone talk longer if someone is waiting to use their phone? In this experiment, conducted in 1989, "the investigators measured the length of time (in seconds) that subjects spent on the telephone under one of three conditions: when alone (A), when one person was using an adjacent telphone (B), or when one person was using an adjacent telephone and another person was waiting to use one of the two telephones. The study was conducted in an alcove of a shopping mall, an area that contained only the two adjacent telphones." (Quotation from Business Statistics, 6th. ed. 1992, by W. Daniel and J. Terrell.)
A data frame with 56 observations on the following 3 variables.
Sex of the subject.
Which condition the subject was put into (A, B or C as described above) by the researchers.
Time in seconds that the subject spent on the phone.
R.B. Ruback, K.D. Poe, and P.Doriat, "Waiting on a Phone: Intrusion on Callers Leads to Territorial Defense" Social Psychology Quarterly, 52:232-241. Gender data provided by R.B. Ruback (personal communication).
Amazon.com reader-reviews of several popular books.
A data frame with 243,269 observations on the following 5 variables.
book
The book under review. Values along with book-titles are as follows:
hunger:
"The Hunger Games"
shades:
"Fifty Shades of Gray"
fault:
"The Fault in our Stars"
martian:
"The Martian"
unbroken:
"Unbroken"
gonegirl:
"The Gone Girl"
traingirl:
"Girl on a Train"
goldfinch:
"The Goldfinch"
rating
rating assigned (1-5)
URL_fragment
Prepend "https://www.amazon.com/" to get the full URL of the review.
review_title
Title of the review; usually a concise judgment of the book.
content
HTML of the review text.
This data frame is a compilation of the data sets in "Amazon Book Reviews", in the UC-Irvine Machine Learning Repository. See https://archive.ics.uci.edu/ml/datasets/Amazon+book+reviews for more information.
Subset of data from a study on edibility of mushrroms. The individual mushrooms come from 23 species of gilled mushrooms in the Agaricus and Lepiota Family. The aim is to come up with a rule for predicting, on the basis of an individual mushroom's characteristics, whether or not the mushroom is edible. Remaining data is held back for evaluation of proposed rules.
A data frame with 5891 observations on the following 23 variables.
Whether the mushroom is edible or poisonous.
Whether or not the mushroom is bruised.
A sample from of mushroom records drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf, Original data contributed by Jeffrey Schlimmer to the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml, Irvine, CA: University of California, School of Information and Computer Science. See http://archive.ics.uci.edu/ml/datasets/Mushroom.
28 NFL teams from the 1980's. Team uniforms were rated for their "malevolence", and the average penslty yardage for each team was also recorded.
A data frame with 28 observations on the following 3 variables.
Name of NFL team.
Rating of "malevolence" accorded to the eam uniform. High scores indicate more malevolence.
Mean penalty yardage per game for the team, expressed as a z-score.
The Dark Side of Self- and Social Perception: Black Uniforms and Agression in Professional Sports, Frank and Gilovich, Journal of Personality and Social Psychology, 1988, Vol. 54, No. 1, 74-85.
Weather data collected at the Macleish Field Station in Whately, MA during 2015. This is a copy of the whately_2015 data from package 'macleish': <https://github.com/beanumber/macleish>.
For both, a data frame ([dplyr::tbl_df()]) with roughly 52,560 rows and 8 or 9 variables.
The following variables are values that are found in either the 'whately_2015' or 'orchard_2015' data tables.
All variables are averaged over the 10 minute interval unless otherwise noted.
Timestamp for each measurement set in Eastern Standard Time.
average temperature, in Celsius
Wind speed, in meters per second
Wind direction, in degrees
How much water there is in the air, in millimeters
Atmospheric pressure, in millibars
Total rainfall, in millimeters
Amount of radiation coming from the sun, in Watts/meters^2. Solar measurement for Whately
Photosynthetically Active Radiation (sunlight between 400 and 700 nm), in average density of Watts/meters^2. One of two solar measurements for Orchard
Photosynthetically Active Radiation (sunlight between 400 and 700 nm), in average total over measurement period of Watts/meters^2. One of two solar measurements for Orchard
The Macleish Field Station is a remote outpost owned by Smith College and used for field research. There are two weather stations on the premises. One is called 'WhatelyMet' and the other is 'OrchardMet'.
The 'WhatelyMet' station is located at (42.448470, -72.680553) and the 'OrchardMet' station is at (42.449653, -72.680315).
'WhatelyMet' is located at the end of Poplar Hill Road in Whately, Massachusetts, USA. The meteorological instruments of 'WhatelyMet' (except the rain gauge) are mounted at the top of a tower 25.3 m tall, well above the surrounding forest canopy. The tower is located on a local ridge at an elevation 250.75m above sea level.
'OrchardMet' is located about 250 m north of the first tower in an open field next to an apple orchard. Full canopy trees (~20 m tall) are within 30 m of this station. This station has a standard instrument configuration with temperature, relative humidity, solar radiation, and barometric pressure measured between 1.5 and 2.0 m above the ground. Wind speed and direction are measured on a 10 m tall tower and precipitation is measured on the ground. Ground temperature is measured at 15 and 30 cm below the ground surface 2 m south of the tower. The tower is located 258.1 m above sea level. Data collection at OrchardMet began on June 27th, 2014.
The variables shown above are weather data collected at 'WhatelyMet' and 'OrchardMet' during 2015. Solar radiation is measured in two different ways: see 'SlrW_Avg'or the 'PAR' variables for Photosynthetic Active Radiation.
Note that a loose wire resulted in erroneous temperature reading at OrchardMet in late November, 2015.
These data are recorded at <https://www.smith.edu/about-smith/sustainable-smith/ceeds>