Cleaning the Medallions
March 4, 2018
There are 22 variables for each Boro taxi trip, and 18 variables for each medallion one. Based on what we intend to visualize and analyze, we can restrict our scope to the following fields.
Variable | Description |
---|---|
dropoff_datetime | Date and time when meter was disengaged |
pickup_datetime | Date and time when meter was engaged |
pickup_longitude | Longitude where meter was engaged |
pickup_latitude | Latitude where meter was engaged |
dropoff_longitude | Longitude where meter was disengaged |
dropoff_latitude | Latitude where meter was disengaged |
passenger_count | Number of passengers in the vehicle (entered by the driver) |
trip_distance | Elapsed trip time in miles, reported by the taximeter |
fare_amount | Time-and-distance fare calculated by the meter |
tip_amount | Tip amount - automatically populated for credit card tips, cash tips are not included |
payment_type | A numeric code signifying how the passenger paid for the trip 1 = Credit Card 2 = Cash 3 = No charge 4 = Dispute 5 = Unknown 6 = Voided trip |
Cleaning the Medallions
From trips with negative fares, to ones that start in the Atlantic Ocean, it is clear that we have some data wramgling to do before we can get down to the analysis.
Sanity Checks
We get an overview of the data through overall counts, followed by aggregating it by month as a rough overall check that there are no glaring issues with the dataset.
I am happy to report that the trip counts across months for Boro and Medallion trips seem reasonable.
Pickup/Dropoff Locations
According to Netstate, New York spans the longitude from -79.4554 to -71.4725 and a latitude of 40.2940 to 45.0042.
We found 2.5 million Medallion trips and 47 thousand Boro trips that did not meet this criteria. It appears that the system defaults longitude and latitude to 0 when it is unavailable. There are also instances where the location logged does not fall within New York City.
Passenger Count
This field is input by the driver. Based on the frequencies of each value, I think that it will be safe to conclude that values 0, 7, 8 and 9 are erroneous.
Trip Distance
Thankfully, we did not find any trips where the logged distance was negative. However we did find a significant number that logged 0 miles. These rows also seem to have issues with other fields, and therefore we will remove them.
Fare
According to the NYC Taxi & Limousine Commission, the initial rate of a taxi ride is USD2.50. This implies that we should reject any rows that have recorded fares below this amount. This will remove 121,297 medallion trips and 95,211 Boro trips.
Tip
Interestingly, we note several instances where the tip is negative. We will be removing these records.
Payment Type
We are only interested in trips paid by credit card or cash, as the other payment types suggest that there are issues with the trip.