Cleaning the Medallions

March 4, 2018

There are 22 variables for each Boro taxi trip, and 18 variables for each medallion one. Based on what we intend to visualize and analyze, we can restrict our scope to the following fields.

Variable Description
dropoff_datetime Date and time when meter was disengaged
pickup_datetime Date and time when meter was engaged
pickup_longitude Longitude where meter was engaged
pickup_latitude Latitude where meter was engaged
dropoff_longitude Longitude where meter was disengaged
dropoff_latitude Latitude where meter was disengaged
passenger_count Number of passengers in the vehicle (entered by the driver)
trip_distance Elapsed trip time in miles, reported by the taximeter
fare_amount Time-and-distance fare calculated by the meter
tip_amount Tip amount - automatically populated for credit card tips, cash tips are not included
payment_type A numeric code signifying how the passenger paid for the trip

1 = Credit Card
2 = Cash
3 = No charge
4 = Dispute
5 = Unknown
6 = Voided trip

Cleaning the Medallions

From trips with negative fares, to ones that start in the Atlantic Ocean, it is clear that we have some data wramgling to do before we can get down to the analysis.

Sanity Checks

We get an overview of the data through overall counts, followed by aggregating it by month as a rough overall check that there are no glaring issues with the dataset.

I am happy to report that the trip counts across months for Boro and Medallion trips seem reasonable.

Pickup/Dropoff Locations

According to Netstate, New York spans the longitude from -79.4554 to -71.4725 and a latitude of 40.2940 to 45.0042.

We found 2.5 million Medallion trips and 47 thousand Boro trips that did not meet this criteria. It appears that the system defaults longitude and latitude to 0 when it is unavailable. There are also instances where the location logged does not fall within New York City.

Passenger Count

This field is input by the driver. Based on the frequencies of each value, I think that it will be safe to conclude that values 0, 7, 8 and 9 are erroneous.

Trip Distance

Thankfully, we did not find any trips where the logged distance was negative. However we did find a significant number that logged 0 miles. These rows also seem to have issues with other fields, and therefore we will remove them.

Fare

According to the NYC Taxi & Limousine Commission, the initial rate of a taxi ride is USD2.50. This implies that we should reject any rows that have recorded fares below this amount. This will remove 121,297 medallion trips and 95,211 Boro trips.

Tip

Interestingly, we note several instances where the tip is negative. We will be removing these records.

Payment Type

We are only interested in trips paid by credit card or cash, as the other payment types suggest that there are issues with the trip.

comments powered by Disqus