Chapter 3 Data

3.1 Sources

After refining our project target, we mainly use the following two data sources.

3.1.1 DOF Condominium Comparable Rental Income in NYC

url: https://data.cityofnewyork.us/City-Government/DOF-Condominium-Comparable-Rental-Income-in-NYC/9ck6-2jew

This dataset is used by DOF(The Department of Finance) to value condominiums, which contains basic information of condominiums for rent in New York, including geographic location information (coded by borough ID and priority ID) and house properties. Collected by The Department of Finance (DOF) through the investigation of apartment information, it is highly reliable and basically error free; Because it includes detailed address and house information, this dataset is very convenient for us to analyze the geographical distribution of rental prices. The dataset contains 61 columns, but we only use a few of these. The used columns’ infomation:

Boro-Block-Lot: Text variable. The Borough-Block-Lot location of the subject condominium. The lot identifies the condominium billing lot generally associated with the condominium management organization.

Building Classification: Text variable(we transform it later). The Building Class code is used to describe a property’s use. This report includes the two character code as well as the description of the building class.

Total Units: Number variable. Total number of units in the building

Year Built: Text variable. The year the building was built

Estimated Gross Income: Number variable. Estimated Income per SquareFoot * Gross SquareFoot

Gross Income per SqFt:
Number variable. Estimated income per squarefoot of median comparable

Full Market Value
Number variable. Current year’s total market value of the land and building

3.1.2 Housing:United States Census Bureau

This data source contains different housing information in New York State provided by the United States Census Bureau(USCB). USCB is a principal agency of the U.S. Federal Statistical System and is in charge of gathering information on the population and economy of the country. We filter the tables by geography and find the topics we are interested in. The dataset can be downloaded as csv or excel file which can be easily imported by R. All the tables from this source is not tidy, and the data type is character although it actually provides numerical information. We will transform it later.

3.1.2.1 Gross Rent

url:https://data.census.gov/table?q=B250&g=0600000US3600508510,3604710022,3606144919,3608160323,3608570915_1600000US3651000&tid=ACSDT1Y2021.B25063

This table contains 1-Year Estimates of the rent of the occupied housing units in five boroughs in New York City.

Size: 27 rows by 12 columns

Columns: Estimates and Margin of Error for five boroughs

Row names: The range of rent(e.g. ‘$700 to $749’)

3.1.2.2 Median Gross Rent

url:https://data.census.gov/table?q=B250&g=0600000US3600508510,3604710022,3606144919,3608160323,3608570915_1600000US3651000&tid=ACSDT1Y2021.B25064

url:https://data.census.gov/table?q=B2506&g=0400000US36$8600000&tid=ACSDT5Y2021.B25064

The first table contains 5-Year Estimates of the median gross rent of the occupied housing units in five boroughs in New York City in different years.

Size:6 rows by 12 columns

Columns: Estimates and Margin of Error for five boroughs

Row names: Year

The second table contains 1-Year Estimates of the median gross rent of the occupied housing units filter by the ZIP code in New York City.

Size: 1 row by 3652 columns

Columns: Estimates and Margin of Error for different ZIP codes

3.1.2.3 Vacancy Status

url:https://data.census.gov/table?q=B250&g=0600000US3600508510,3604710022,3606144919,3608160323,3608570915_1600000US3651000

This table contains 1-Year Estimates of the number of vacant housing units in five boroughs in New York City.

Size: 8 rows by 12 columns

Columns: Estimates and Margin of Error for five boroughs

Row names: Vacant status(e.g. ‘Rented, not occupied’)

3.1.2.4 Median Gross Rent by Year Householder Moved into Unit

url:https://data.census.gov/table?q=B2511&g=0600000US3600508510,3604710022,3606144919,3608160323,3608570915_1600000US3651000&tid=ACSDT1Y2021.B25113

This table contains 1-Year Estimates of the median gross rent by year that householder moved into Unit in five boroughs in New York City.

Size: 7 rows by 12 columns

Columns: Estimates and Margin of Error for five boroughs

Row names: Year householder moved in(e.g. ‘Moved in 2019 or later’)

3.2 Cleaning / transformation

As both the datasets are collected by government offices and well organized, ther are tidy and so we didn’t perform cleaning methods on them.

3.2.1 DOF Condominium Comparable Rental Income in NYC

We transform some columns(such as Building Classification and Boro-block-lot) to easier identified formats for specific visualization tasks. The code is attached in the Results part.

3.2.2 Housing:United States Census Bureau

The transformation of the data from USCB mainly concentrates on two parts: tidying data and type conversion. The problem of the data sets is the geographical labels are listed in the columns. We will use the function ‘pivot_longer’ to deal with it. And we will use the function ‘gsub’ to clean the data and change the character to numerical data type.

3.3 Missing value analysis

3.3.1 DOF Condominium Comparable Rental Income in NYC

##              Year.Built   Market.Value.per.SqFt 
##                     108                       3 
##       Full.Market.Value       Estimated.Expense 
##                       2                       1 
##        Expense.per.SqFt          Boro.Block.Lot 
##                       1                       0 
##           Condo.Section                 Address 
##                       0                       0 
##            Neighborhood Building.Classification 
##                       0                       0 
##             Total.Units              Gross.SqFt 
##                       0                       0 
##  Estimated.Gross.Income   Gross.Income.per.SqFt 
##                       0                       0 
##    Net.Operating.Income                  county 
##                       0                       0 
##              house_type 
##                       0

According to the counts of missing values in each column, we can see that the majority of NAs are about when the building was built. As there are too many variables, we focus our analysis only on those with at least one missing value.

##     county cnt.na     n      percent
## 1    bronx      1  1042 0.0009596929
## 2    kings     97  9314 0.0104144299
## 3 new york      2 12784 0.0001564456
## 4   queens      2  5022 0.0003982477
## 5 richmond      6   345 0.0173913043

As there are more than 20,000 rows of data, most of the rows are complete cases. Among the variables we care about, Year.Built is the only one that has several missing values. Thus, we perform analysis on the missing pattern between this variable and the county the building belongs to. We find that Richmond County (Staten Island Borough) has the largest percentage of missing values and also the least number of buildings in our analysis.

3.3.2 Housing:United States Census Bureau

According to the counts of missing values for each table, none of them has missing values in any column. Take the table about the median gross rent by year householder moved into unit for example, we can draw a missing data plot. Because the type of variables from USCB is character, we may need further analysis to see if there are missing values.