[ECC DS 6주차] New York City Taxi Trip Duration_2. EDA + Baseline Model

2023-05-08

0. 라이브러리 & 모듈 import

### ML관련
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

### 사이킷런
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer # sklearn 버전으로 인해 바꿔줘야 함
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor

### 시간 관련
from datetime import datetime
import calendar
import matplotlib.dates as mdates
import matplotlib as mpl
from datetime import timedelta
import datetime as dt

### 수학적 계산 관련
from math import sin, cos, sqrt, atan2, radians

### 지도 시각화 관련
import folium
from folium import FeatureGroup, LayerControl, Map, Marker
from folium.plugins import HeatMap

### 파일 입출력 관련
import pickle

### 오류 관련
import warnings
warnings.filterwarnings('ignore')

### 옵션 설정
pd.set_option('display.max_colwidth', -1)
plt.style.use('fivethirtyeight')

1. 데이터 준비하기

주제: 뉴욕시에서 택시 여행의 총 승차 시간을 예측하는 모델을 구축하는 것

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

### 데이터 불러오기

train = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ECC 48기 데과B/9주차/data/train.csv")
test = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/ECC 48기 데과B/9주차/data/test.csv")

### 데이터 형태 확인하기

print('train shape : ', train.shape, 'test shape : ', test.shape)

train shape :  (1458644, 11) test shape :  (625134, 9)

### 데이터 미리보기

train.head()

	id	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag	trip_duration
0	id2875421	2	2016-03-14 17:24:55	2016-03-14 17:32:30	1	-73.982155	40.767937	-73.964630	40.765602	N	455
1	id2377394	1	2016-06-12 00:43:35	2016-06-12 00:54:38	1	-73.980415	40.738564	-73.999481	40.731152	N	663
2	id3858529	2	2016-01-19 11:35:24	2016-01-19 12:10:48	1	-73.979027	40.763939	-74.005333	40.710087	N	2124
3	id3504673	2	2016-04-06 19:32:31	2016-04-06 19:39:40	1	-74.010040	40.719971	-74.012268	40.706718	N	429
4	id2181028	2	2016-03-26 13:30:55	2016-03-26 13:38:10	1	-73.973053	40.793209	-73.972923	40.782520	N	435

test.head()

	id	vendor_id	pickup_datetime	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag
0	id3004672	1	2016-06-30 23:59:58	1	-73.988129	40.732029	-73.990173	40.756680	N
1	id3505355	1	2016-06-30 23:59:53	1	-73.964203	40.679993	-73.959808	40.655403	N
2	id1217141	1	2016-06-30 23:59:47	1	-73.997437	40.737583	-73.986160	40.729523	N
3	id2150126	2	2016-06-30 23:59:41	1	-73.956070	40.771900	-73.986427	40.730469	N
4	id1598245	1	2016-06-30 23:59:33	1	-73.970215	40.761475	-73.961510	40.755890	N

1-1. Data Description

id: 각 trip에 대한 고유 식별자
vendor_id: 주행 기록과 연결된 제공자를 나타내는 코드
pick_datetime: 미터기가 작동된 날짜 및 시간
dropoff_datetime: 미터기가 해제된 날짜 및 시간
passenger_count: 차량에 탑승한 승객 수(운전자 입력 값)
pickup_longitude: 미터기가 걸려 있던 경도
pickup_latitude: 미터기가 작동된 위도
dropoff_longitude: 미터기가 해제된 경도
dropoff_latitude: 미터기가 해제된 위도
store_and_fwd_flag: 이 플래그는 차량이 서버와 연결되지 않았기 때문에 공급업체에 전송하기 전에 트립 레코드를 차량 메모리에 보관했는지 여부를 표시함
- Y: store and forward/ N: store 및 Forward trip
trip_timeout: 여행 기간(초)

1-2. 적절한 데이터형으로 변환

train['pickup_datetime'] = pd.to_datetime(train['pickup_datetime'],format = '%Y-%m-%d %H:%M:%S')
train['dropoff_datetime'] = pd.to_datetime(train['dropoff_datetime'],format = '%Y-%m-%d %H:%M:%S')
train.head()

	id	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag	trip_duration
0	id2875421	2	2016-03-14 17:24:55	2016-03-14 17:32:30	1	-73.982155	40.767937	-73.964630	40.765602	N	455
1	id2377394	1	2016-06-12 00:43:35	2016-06-12 00:54:38	1	-73.980415	40.738564	-73.999481	40.731152	N	663
2	id3858529	2	2016-01-19 11:35:24	2016-01-19 12:10:48	1	-73.979027	40.763939	-74.005333	40.710087	N	2124
3	id3504673	2	2016-04-06 19:32:31	2016-04-06 19:39:40	1	-74.010040	40.719971	-74.012268	40.706718	N	429
4	id2181028	2	2016-03-26 13:30:55	2016-03-26 13:38:10	1	-73.973053	40.793209	-73.972923	40.782520	N	435

제대로 변환된 것을 확인할 수 있다.

2. 데이터 정리

2-1. 결측치 확인

train[pd.isnull(train)].sum()

id                    0  
vendor_id             0.0
passenger_count       0.0
pickup_longitude      0.0
pickup_latitude       0.0
dropoff_longitude     0.0
dropoff_latitude      0.0
store_and_fwd_flag    0  
trip_duration         0.0
dtype: object

결측치는 없음을 확인할 수 있다.

2-2. 데이터가 측정된 기간

print("Min pickup time:",min(train['pickup_datetime']))
print("Max pickup time:",max(train['pickup_datetime']))

Min pickup time: 2016-01-01 00:00:17
Max pickup time: 2016-06-30 23:59:39

2016/01/01부터 2016/06/30까지 기록된 데이터이다.

2-3. 데이터 가공

a) datetime에서 day, month, hour 정보를 생성

train['pickup_date'] = train['pickup_datetime'].dt.date
train['pickup_day'] = train['pickup_datetime'].apply(lambda x:x.day)
train['pickup_hour'] = train['pickup_datetime'].apply(lambda x:x.hour)
train['pickup_day_of_week'] = train['pickup_datetime'].apply(lambda x:calendar.day_name[x.weekday()])

train['dropoff_date'] = train['dropoff_datetime'].dt.date
train['dropoff_day'] = train['dropoff_datetime'].apply(lambda x:x.day)
train['dropoff_hour'] = train['dropoff_datetime'].apply(lambda x:x.hour)
train['dropoff_day_of_week'] = train['dropoff_datetime'].apply(lambda x:calendar.day_name[x.weekday()])

train.head(3)

	id	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag	trip_duration	pickup_date	pickup_day	pickup_hour	pickup_day_of_week	dropoff_date	dropoff_day	dropoff_hour	dropoff_day_of_week
0	id2875421	2	2016-03-14 17:24:55	2016-03-14 17:32:30	1	-73.982155	40.767937	-73.964630	40.765602	N	455	2016-03-14	14	17	Monday	2016-03-14	14	17	Monday
1	id2377394	1	2016-06-12 00:43:35	2016-06-12 00:54:38	1	-73.980415	40.738564	-73.999481	40.731152	N	663	2016-06-12	12	0	Sunday	2016-06-12	12	0	Sunday
2	id3858529	2	2016-01-19 11:35:24	2016-01-19 12:10:48	1	-73.979027	40.763939	-74.005333	40.710087	N	2124	2016-01-19	19	11	Tuesday	2016-01-19	19	12	Tuesday

b) 위도/경도 정보 수정

# 위도 경도 변수 소수점 이하 3자리까지 반올림

train['pickup_latitude_round3'] = train['pickup_latitude'].apply(lambda x:round(x,3))
train['pickup_longitude_round3'] = train['pickup_longitude'].apply(lambda x:round(x,3))
train['dropoff_latitude_round3'] = train['dropoff_latitude'].apply(lambda x:round(x,3))
train['dropoff_longitude_round3'] = train['dropoff_longitude'].apply(lambda x:round(x,3))

train.head()

	id	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag	...	pickup_hour	pickup_day_of_week	dropoff_date	dropoff_day	dropoff_hour	dropoff_day_of_week	pickup_latitude_round3	pickup_longitude_round3	dropoff_latitude_round3	dropoff_longitude_round3
0	id2875421	2	2016-03-14 17:24:55	2016-03-14 17:32:30	1	-73.982155	40.767937	-73.964630	40.765602	N	...	17	Monday	2016-03-14	14	17	Monday	40.768	-73.982	40.766	-73.965
1	id2377394	1	2016-06-12 00:43:35	2016-06-12 00:54:38	1	-73.980415	40.738564	-73.999481	40.731152	N	...	0	Sunday	2016-06-12	12	0	Sunday	40.739	-73.980	40.731	-73.999
2	id3858529	2	2016-01-19 11:35:24	2016-01-19 12:10:48	1	-73.979027	40.763939	-74.005333	40.710087	N	...	11	Tuesday	2016-01-19	19	12	Tuesday	40.764	-73.979	40.710	-74.005
3	id3504673	2	2016-04-06 19:32:31	2016-04-06 19:39:40	1	-74.010040	40.719971	-74.012268	40.706718	N	...	19	Wednesday	2016-04-06	6	19	Wednesday	40.720	-74.010	40.707	-74.012
4	id2181028	2	2016-03-26 13:30:55	2016-03-26 13:38:10	1	-73.973053	40.793209	-73.972923	40.782520	N	...	13	Saturday	2016-03-26	26	13	Saturday	40.793	-73.973	40.783	-73.973

5 rows × 23 columns

c) 위도 경도를 사용하여 km 단위 거리로 계산하기

### 함수 정의
# lambda 함수으 인자로 넣어 데이터를 쉽게 변경하기 위해서!

def calculateDistance(row):
    R = 6373.0 # 대략적인 지구의 반지름(상수)
    
    # 60분법 -> 호도법(라디안 각)
    pickup_lat = radians(row['pickup_latitude'])
    pickup_lon = radians(row['pickup_longitude'])
    dropoff_lat = radians(row['dropoff_latitude'])
    dropoff_lon = radians(row['dropoff_longitude'])

    dlon = dropoff_lon - pickup_lon # 경도
    dlat = dropoff_lat - pickup_lat # 위도

    ### 거리 계산
    a = sin(dlat / 2)**2 + cos(pickup_lat) * cos(dropoff_lat) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c
    
    return distance

train['trip_distance'] = train.apply(lambda row: calculateDistance(row), axis = 1)
train.head()

	id	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag	...	pickup_day_of_week	dropoff_date	dropoff_day	dropoff_hour	dropoff_day_of_week	pickup_latitude_round3	pickup_longitude_round3	dropoff_latitude_round3	dropoff_longitude_round3	trip_distance
0	id2875421	2	2016-03-14 17:24:55	2016-03-14 17:32:30	1	-73.982155	40.767937	-73.964630	40.765602	N	...	Monday	2016-03-14	14	17	Monday	40.768	-73.982	40.766	-73.965	1.498991
1	id2377394	1	2016-06-12 00:43:35	2016-06-12 00:54:38	1	-73.980415	40.738564	-73.999481	40.731152	N	...	Sunday	2016-06-12	12	0	Sunday	40.739	-73.980	40.731	-73.999	1.806074
2	id3858529	2	2016-01-19 11:35:24	2016-01-19 12:10:48	1	-73.979027	40.763939	-74.005333	40.710087	N	...	Tuesday	2016-01-19	19	12	Tuesday	40.764	-73.979	40.710	-74.005	6.387103
3	id3504673	2	2016-04-06 19:32:31	2016-04-06 19:39:40	1	-74.010040	40.719971	-74.012268	40.706718	N	...	Wednesday	2016-04-06	6	19	Wednesday	40.720	-74.010	40.707	-74.012	1.485965
4	id2181028	2	2016-03-26 13:30:55	2016-03-26 13:38:10	1	-73.973053	40.793209	-73.972923	40.782520	N	...	Saturday	2016-03-26	26	13	Saturday	40.793	-73.973	40.783	-73.973	1.188962

5 rows × 24 columns

train['trip_duration_in_hour'] = train['trip_duration'].apply(lambda x:x/3600)
train.head()

	id	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag	...	dropoff_date	dropoff_day	dropoff_hour	dropoff_day_of_week	pickup_latitude_round3	pickup_longitude_round3	dropoff_latitude_round3	dropoff_longitude_round3	trip_distance	trip_duration_in_hour
0	id2875421	2	2016-03-14 17:24:55	2016-03-14 17:32:30	1	-73.982155	40.767937	-73.964630	40.765602	N	...	2016-03-14	14	17	Monday	40.768	-73.982	40.766	-73.965	1.498991	0.126389
1	id2377394	1	2016-06-12 00:43:35	2016-06-12 00:54:38	1	-73.980415	40.738564	-73.999481	40.731152	N	...	2016-06-12	12	0	Sunday	40.739	-73.980	40.731	-73.999	1.806074	0.184167
2	id3858529	2	2016-01-19 11:35:24	2016-01-19 12:10:48	1	-73.979027	40.763939	-74.005333	40.710087	N	...	2016-01-19	19	12	Tuesday	40.764	-73.979	40.710	-74.005	6.387103	0.590000
3	id3504673	2	2016-04-06 19:32:31	2016-04-06 19:39:40	1	-74.010040	40.719971	-74.012268	40.706718	N	...	2016-04-06	6	19	Wednesday	40.720	-74.010	40.707	-74.012	1.485965	0.119167
4	id2181028	2	2016-03-26 13:30:55	2016-03-26 13:38:10	1	-73.973053	40.793209	-73.972923	40.782520	N	...	2016-03-26	26	13	Saturday	40.793	-73.973	40.783	-73.973	1.188962	0.120833

5 rows × 25 columns

3. EDA(Exploratory Data Analysis)

3-1. 데이터 분포 확인

### 여행 지속 시간 분포

plt.figure(figsize = (8,5))
sns.distplot(train['trip_duration_in_hour']).set_title("Distribution of Trip Duration")
plt.xlabel("Trip Duration (in hour)")

Text(0.5, 0, 'Trip Duration (in hour)')

여행 지속시간이 24시간 이상인 데이터들이 존재함

outlier_trip_duration = train.loc[train['trip_duration_in_hour'] > 24]
outlier_trip_duration

	id	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag	...	dropoff_date	dropoff_day	dropoff_hour	dropoff_day_of_week	pickup_latitude_round3	pickup_longitude_round3	dropoff_latitude_round3	dropoff_longitude_round3	trip_distance	trip_duration_in_hour
355003	id1864733	1	2016-01-05 00:19:42	2016-01-27 11:08:38	1	-73.789650	40.643559	-73.956810	40.773087	N	...	2016-01-27	27	11	Wednesday	40.644	-73.790	40.773	-73.957	20.154989	538.815556
680594	id0369307	1	2016-02-13 22:38:00	2016-03-08 15:57:38	2	-73.921677	40.735252	-73.984749	40.759979	N	...	2016-03-08	8	15	Tuesday	40.735	-73.922	40.760	-73.985	5.984365	569.327222
924150	id1325766	1	2016-01-05 06:14:15	2016-01-31 01:01:07	1	-73.983788	40.742325	-73.985489	40.727676	N	...	2016-01-31	31	1	Sunday	40.742	-73.984	40.728	-73.985	1.635641	618.781111
978383	id0053347	1	2016-02-13 22:46:52	2016-03-25 18:18:14	1	-73.783905	40.648632	-73.978271	40.750202	N	...	2016-03-25	25	18	Friday	40.649	-73.784	40.750	-73.978	19.906909	979.522778

4 rows × 25 columns

여행 기간이 매우 긴 4개의 기록이 있음
- 하지만 여행 거리는 매우 짧음
- 해당 데이터들을 이상치라 판단할 수 있음
여행 기간 또한 왜곡되어 있음

=> 로그 변환을 수행

### 로그 변환 후 데이터 분포 확인

plt.figure(figsize = (8,5))
sns.distplot(np.log(train['trip_duration'].values)).set_title("Distribution of Trip Duration")
plt.title("Distribution of trip duration (sec) in Log Scale")

Text(0.5, 1.0, 'Distribution of trip duration (sec) in Log Scale')

여행 지속시간 컬럼을 로그 변환한 결과 데이터의 분포가 정규분포를 따름
대부분의 여행은 54초(4)에서 2980초(8) 사이에 있음
- 대부분의 여행은 1시간 이내라는 점을 시사
하지만, 1분이 채 되지 않는 여행들과 100시간 동안 지속되는 여행들도 존재
- 이상치일 가능성이 높음

3-2. 장소

a) pickup과 dropoff이 공통적으로 발생하는 장소들에 대한 heatmap

pickup = train.groupby(['pickup_latitude_round3','pickup_longitude_round3'])['id'].count().reset_index().rename(columns = {'id':'Num_Trips'})

pickup.head()

	pickup_latitude_round3	pickup_longitude_round3	Num_Trips
0	34.360	-65.848	1
1	34.712	-75.354	1
2	35.082	-71.800	1
3	35.310	-72.074	1
4	36.029	-77.441	1

### folium(지도 시각화 툴)을 활용하여 시각화
# pickup 장소

pickup_map = folium.Map(location = [40.730610,-73.935242],
                        zoom_start = 10,)

# print(pickup.shape)

### 각 pickup 지점을 원형 마커로 표기
'''
for index, row in pickup.iterrows():
    
    folium.CircleMarker([row['pickup_latitude_round3'], row['pickup_longitude_round3']],
                        radius=3,
                        
                        fill_color="#3db7e4", 
                        fill_opacity=0.9
                       ).add_to(pickup_map)
    count = count + 1


'''

hm_wide = HeatMap(list(zip(pickup.pickup_latitude_round3.values, pickup.pickup_longitude_round3.values, np.array(pickup.Num_Trips.values).astype('float64'))),
                     min_opacity = 0.2,
                     radius = 5, blur = 15,
                     max_zoom = 1 
                 )
pickup_map.add_child(hm_wide)

pickup_map 

Make this Notebook Trusted to load map: File -> Trust Notebook

city_long_border = (-74.03, -73.75) # 경도 범위
city_lat_border = (40.63, 40.85) # 위도 범위
fig, ax = plt.subplots(ncols = 1, sharex = True, sharey = True)
ax.scatter(train['pickup_longitude'], train['pickup_latitude'],
              color = 'blue', label = 'train', alpha = 0.1)

fig.suptitle('Lat Lng of Pickups in Train Data as Scatter Plot')

ax.set_ylabel('latitude')
ax.set_xlabel('longitude')
plt.ylim(city_lat_border)
plt.xlim(city_long_border)

(-74.03, -73.75)

JFK 근처의 픽업 밀도가 높은 것을 명확하게 드러남

drop = train.groupby(['dropoff_latitude_round3','dropoff_longitude_round3'])['id'].count().reset_index().rename(columns = {'id':'Num_Trips'})

### dropout 장소 시각화

drop_map = folium.Map(location = [40.730610,-73.935242],zoom_start = 10,)
#print(pickup.shape)
### For each pickup point add a circlemarker
'''
for index, row in drop.iterrows():
    
    folium.CircleMarker([row['dropoff_latitude_round3'], row['dropoff_longitude_round3']],
                        radius=3,
                        
                        color="#008000", 
                        fill_opacity=0.9
                       ).add_to(drop_map)
    count=count + 1

'''
hm_wide = HeatMap(list(zip(drop.dropoff_latitude_round3.values, drop.dropoff_longitude_round3.values, np.array(drop.Num_Trips.values).astype('float64'))),
                  min_opacity = 0.2,
                  radius = 5, blur = 15,
                  max_zoom = 1 
                 )
drop_map.add_child(hm_wide)

drop_map

Make this Notebook Trusted to load map: File -> Trust Notebook

pickup 장소와 dropoff 장소가 비슷함

b) pickup이 point에서 시작될 때 여행 지속 시간의 heatmap

pickup = train.groupby(['pickup_latitude_round3','pickup_longitude_round3'])['trip_duration'].mean().reset_index().rename(columns = {'trip_duration':'Avg_Trip_duration'})

### folium으로 시각화

pickup_map = folium.Map(location = [40.730610, -73.935242], zoom_start = 10,)


hm_wide = HeatMap(list(zip(pickup.pickup_latitude_round3.values, pickup.pickup_longitude_round3.values, 
                           pickup.Avg_Trip_duration.values)),
                     min_opacity = 0.2,
                     radius = 7, blur = 15,
                     max_zoom = 1 
                 )
pickup_map.add_child(hm_wide)

pickup_map

Make this Notebook Trusted to load map: File -> Trust Notebook

JFK에서 출발하는 경우 평균 여행 기간이 더 긴 경향이 있다.
자세히 들여다보면 맨하탄 이후임을 확인할 수 있음

3-3. 시간대

a) pickup과 dropoff이 많이 발생하는 시간대

plt.figure(figsize = (8,5))
sns.countplot(x = train['pickup_hour']).set_title("Pickup Hours Distribution")

Text(0.5, 1.0, 'Pickup Hours Distribution')

이른 아침시간에는 적음
오후 6 ~ 8시 사이가 피크임

plt.figure(figsize = (8,5))
sns.countplot(x = train['dropoff_hour']).set_title("Dropoff Hours Distribution")

Text(0.5, 1.0, 'Dropoff Hours Distribution')

dropout 시간대도 pickup 시간대와 비슷함

b) 전체적인 pickup 시간대 분포

plt.figure(figsize = (8,5))
plt.plot(train.groupby('pickup_date').count()[['id']], 
         'o-',label = 'train')

plt.title("Distribution of Pickups over time")

Text(0.5, 1.0, 'Distribution of Pickups over time')

2016년 1월 말 pickup 수가 급격히 감소한 것을 확인할 수 있음

c) 시간대 별 trip 기간

avg_duration_hour = train.groupby(['pickup_hour'])['trip_duration'].mean().reset_index().rename(columns = {'trip_duration':'avg_trip_duration'})
plt.figure(figsize = (8,5))
plt.plot(train.groupby(['pickup_hour'])['trip_duration'].mean(), 'o-')

[<matplotlib.lines.Line2D at 0x7f226bddd960>]

10 ~ 15 시간대 사이에 duration이 증가함

d) 요일별 pickup 분포 시간

plt.figure(figsize=(8,5))
sns.countplot(data = train['pickup_day_of_week'],
              order=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 
                     'Friday', 'Saturday', 'Sunday'])

<Figure size 800x500 with 0 Axes>

왜 에러…

e) 요일별 평균 trip 기간

avg_duration_day = train.groupby(['pickup_day_of_week'])['trip_duration'].mean().reset_index().rename(columns = {'trip_duration' : 'avg_trip_duration'})

plt.figure(figsize = (8,5))
sns.barplot(x = 'pickup_day_of_week', y = 'avg_trip_duration', 
            data = avg_duration_day, 
            order = ['Monday','Tuesday','Wednesday','Thursday',
                     'Friday','Saturday', 'Sunday']).set_title("Avg Trip Duration vs Pickup Days of Week")

Text(0.5, 1.0, 'Avg Trip Duration vs Pickup Days of Week')

3-4. 거리, 지역, 속도

a) 거리 분포 확인

plt.figure(figsize = (8,5))
sns.kdeplot(np.log(train['trip_distance'].values)).set_title("Trip Distance Distribution")
plt.xlabel("Trip Distance(log)") # 로그 변환된 거리

Text(0.5, 0, 'Trip Distance(log)')

b) 여행 지속 시간 & 여행 거리 비교

plt.scatter(np.log(train['trip_distance'].values), np.log(train['trip_duration'].values),
            color = 'blue', label = 'train')
plt.title("Distribution of Trip Distance vs Trip Duration")
plt.xlabel("Trip Distance(log scale)")
plt.ylabel("Trip Duration(log scale)")

Text(0, 0.5, 'Trip Duration(log scale)')

The number of pickups are very low on Monday.From Tuesday to Friday the number of pickups keep increasing

3-5. 제공해 주는 함수로 여행 방향 측정하기

def calculateBearing(lat1,lng1,lat2,lng2):
    R = 6371 
    lng_delta_rad = np.radians(lng2 - lng1)
    lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
    y = np.sin(lng_delta_rad) * np.cos(lat2)
    x = np.cos(lat1) * np.sin(lat2) - np.sin(lat1) * np.cos(lat2) * np.cos(lng_delta_rad)
    
    return np.degrees(np.arctan2(y, x))

train['bearing'] = train.apply(lambda row:calculateBearing(row['pickup_latitude_round3'],
                                                           row['pickup_longitude_round3'],
                                                           row['dropoff_latitude_round3'],
                                                           row['dropoff_longitude_round3']),
                               axis = 1)

a) bearing 분포 확인

sns.kdeplot(train['bearing'])

<Axes: xlabel='bearing', ylabel='Density'>

b) Bearing vs 여행 지속 기간

plt.figure(figsize = (8,5))
plt.scatter(train['bearing'].values,
            y = np.log(train['trip_duration'].values))
plt.xlabel("Bearing")
plt.ylabel("Trip Duration(log scale)")

Text(0, 0.5, 'Trip Duration(log scale)')

여행 지속 시간 중 이상치는 모두 bearing = -50 주변에 분포함

3-6. 여행 레코드

a) Store and FWD Flag 분포 확인

train['store_and_fwd_flag'].value_counts()

N    1450599
Y    8045   
Name: store_and_fwd_flag, dtype: int64

plt.figure(figsize = (8,5))
sns.kdeplot(np.log(train.loc[train['store_and_fwd_flag'] == 'Y','trip_duration'].values),
            label = 'Store and Fwd = Yes')
sns.kdeplot(np.log(train.loc[train['store_and_fwd_flag'] == 'N','trip_duration'].values),
            label = 'Store and Fwd = No')
   
plt.title("Distribution of Store and Fwd Flag vs Trip Duration(log scale)")
plt.xlabel('Trip Duration(log scale)')
plt.ylabel('Density')

Text(0, 0.5, 'Density')

3-7. 지역 군집화

지역을 생성하는데 도움이 될 것임
k-means 군집화 수행

### 좌표 설정

coords = np.vstack((train[['pickup_latitude', 'pickup_longitude']].values,
                    train[['dropoff_latitude', 'dropoff_longitude']].values,
                    test[['pickup_latitude', 'pickup_longitude']].values,
                    test[['dropoff_latitude', 'dropoff_longitude']].values))

### k-means 군집화

kmeans = KMeans(n_clusters = 8, random_state = 0).fit(coords)
train.loc[:, 'pickup_neighbourhood'] = kmeans.predict(train[['pickup_latitude', 
                                                             'pickup_longitude']])
train.loc[:, 'dropoff_neighbourhood'] = kmeans.predict(train[['dropoff_latitude', 
                                                              'dropoff_longitude']])

### 경도, 위도 범위 설정

city_long_border = (-74.03, -73.75)
city_lat_border = (40.63, 40.85)

### 시각화

fig, ax = plt.subplots(ncols = 1, sharex = True, sharey = True)
ax.scatter(train['pickup_longitude'], train['pickup_latitude'],
           c = train['pickup_neighbourhood'], label = 'train', alpha = 0.1)

fig.suptitle('Pickup Neighbourhood')

ax.set_ylabel('latitude')
ax.set_xlabel('longitude')

plt.ylim(city_lat_border)
plt.xlim(city_long_border)

(-74.03, -73.75)

a) 각 지역에서의 pickup 수

plt.figure(figsize = (8,5))

# countplot으로 시각화하는 경우 제대로 시각화가 되지 x
# histplot으로 대체
sns.histplot(train['pickup_neighbourhood']).set_title("Distribution of Number of Pickups across Neighbourhoods")

Text(0.5, 1.0, 'Distribution of Number of Pickups across Neighbourhoods')

지역 0, 3, 6 순으로 pickup 수치가 높음

avg_duration_neighbourhood = train.groupby(['pickup_neighbourhood'])['trip_duration'].mean().reset_index().rename(columns = {'trip_duration':'avg_trip_duration'})
plt.figure(figsize = (8,5))
sns.barplot(x = 'pickup_neighbourhood',y = 'avg_trip_duration',
            data = avg_duration_neighbourhood).set_title("Avg Trip Duration vs Neighbourhood")

Text(0.5, 1.0, 'Avg Trip Duration vs Neighbourhood')

2, 3 지역 순으로 평균 여행 지속시간이 긺
1, 6, 7은 위의 pickup neighbourhood 숫자가 0에 가까울지라도 평균 이용 기간은 높은 축에 속함

3-8. 속도

a) 평균 속도 분포

train['avg_speed_kph'] = train['trip_distance'] / train['trip_duration_in_hour']

plt.figure(figsize = (8,5))

sns.kdeplot(train['avg_speed_kph'].values).set_title("Distribution of Average Speed (in kph)")

Text(0.5, 1.0, 'Distribution of Average Speed (in kph)')

print("Average speed is",np.mean(train['avg_speed_kph']),"kph") 

# 평균 속력은 14.4277kph 정도임

Average speed is 14.427736738459107 kph

b) 일주일의 요일별 평균 속도

교통 속도를 의미

avg_speed_per_day = train.groupby(['pickup_day_of_week'])['avg_speed_kph'].mean().reset_index()

plt.figure(figsize = (8,5))
sns.barplot(x = 'pickup_day_of_week', y = 'avg_speed_kph',
            data = avg_speed_per_day, 
            order = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday', 'Sunday']).set_title("Avg Speed (kph) vs Pickup Days of Week")

Text(0.5, 1.0, 'Avg Speed (kph) vs Pickup Days of Week')

평균 속도의 경우 일요일과 월요일이 더 빠른 경향을 보임

4. 모델링(Modeling)

4-1. Test data_특성 공학(Feature Engineering)

모델 적용을 위해 test data에 대해 featrue engineering을 진행

test['pickup_datetime'] = pd.to_datetime(test['pickup_datetime'],format = '%Y-%m-%d %H:%M:%S')
# test['dropoff_datetime'] = pd.to_datetime(test['dropoff_datetime'], format = '%Y-%m-%d %H:%M:%S')

test['pickup_date'] = test['pickup_datetime'].dt.date
test['pickup_day'] = test['pickup_datetime'].apply(lambda x:x.day)
test['pickup_hour'] = test['pickup_datetime'].apply(lambda x:x.hour)
test['pickup_day_of_week'] = test['pickup_datetime'].apply(lambda x:calendar.day_name[x.weekday()])
# test['dropoff_date'] = test['dropoff_datetime'].dt.date
# test['dropoff_day'] = test['dropoff_datetime'].apply(lambda x:x.day)
# test['dropoff_hour'] = test['dropoff_datetime'].apply(lambda x:x.hour)
# test['dropoff_day_of_week'] = test['dropoff_datetime'].apply(lambda x:calendar.day_name[x.weekday()])

test['pickup_latitude_round3'] = test['pickup_latitude'].apply(lambda x:round(x,3))
test['pickup_longitude_round3'] = test['pickup_longitude'].apply(lambda x:round(x,3))
test['dropoff_latitude_round3'] = test['dropoff_latitude'].apply(lambda x:round(x,3))
test['dropoff_longitude_round3'] = test['dropoff_longitude'].apply(lambda x:round(x,3))

test['trip_distance'] = test.apply(lambda row:calculateDistance(row), axis = 1)
# test['trip_duration_in_hour'] = test['trip_duration'].apply(lambda x:x/3600)

test['bearing'] = test.apply(lambda row:calculateBearing(row['pickup_latitude_round3'],
                                                         row['pickup_longitude_round3'],
                                                         row['dropoff_latitude_round3'],
                                                         row['dropoff_longitude_round3']),
                             axis = 1)

test.loc[:, 'pickup_neighbourhood'] = kmeans.predict(test[['pickup_latitude', 'pickup_longitude']])
test.loc[:, 'dropoff_neighbourhood'] = kmeans.predict(test[['dropoff_latitude', 'dropoff_longitude']])

4-2. 모델 구축하기

train 데이터에서 “dropoff datetime” feature들을 삭제해야 함
lat lng을 소수점 3자리까지 반올림하여 처리

drop_cols = ['avg_speed_kph','trip_duration_in_hour',
             'dropoff_date','dropoff_day','dropoff_hour','dropoff_day_of_week','dropoff_datetime',
             'pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude']

training = train.drop(drop_cols, axis = 1)
testing = test.drop(['pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude'],axis = 1)

우리는 trip_duration을 예측해야 함
- log scale로 변환하여 예측하자.

### 로그 변환

training['log_trip_duration'] = training['trip_duration'].apply(lambda x:np.log(x))
training.drop(['trip_duration'], axis = 1, inplace = True)

print("Training Data Shape ", training.shape)
print("Testing Data Shape ", testing.shape)

Training Data Shape  (1458644, 18)
Testing Data Shape  (625134, 17)

요일을 숫자로 encoding

def encodeDays(day_of_week):
    day_dict = {'Sunday':0, 'Monday':1, 'Tuesday':2, 'Wednesday':3,
                'Thursday':4, 'Friday':5, 'Saturday':6}
                
    return day_dict[day_of_week]

training['pickup_day_of_week'] = training['pickup_day_of_week'].apply(lambda x:encodeDays(x))
testing['pickup_day_of_week'] = testing['pickup_day_of_week'].apply(lambda x:encodeDays(x))

### 최종 데이터 저장
# 가공된 데이터를 최종 파일로 저장

training.to_csv("/content/drive/MyDrive/Colab Notebooks/ECC 48기 데과B/9주차/data/input_training.csv",index = False)
testing.to_csv("/content/drive/MyDrive/Colab Notebooks/ECC 48기 데과B/9주차/data/input_testing.csv",index = False)

del training
del testing
del train
del test

### 필요한 함수 정의

# 1) 라벨 인코딩
def LabelEncoding(train_df,test_df,max_levels = 2):
  for col in train_df:
    if train_df[col].dtype == 'object':
      if len(list(train_df[col].unique())) <= max_levels:
        le = preprocessing.LabelEncoder()
        le.fit(train_df[col])
        train_df[col] = le.transform(train_df[col])
        test_df[col] = le.transform(test_df[col])
      
  return [train_df,test_df]
                

def readInputAndEncode(input_path,train_file,test_file,target_column):
    training = pd.read_csv(input_path + train_file)
    testing = pd.read_csv(input_path + test_file)
   
    training,testing = LabelEncoding(training,testing)
    
    # print("Training Data Shape after Encoding ",training.shape)
    # print("Testing Data Shape after Encoding ",testing.shape)

    ### 모든 train column이 test 데이터에 있는지 확인
    # 그렇지 않다면 test data에 column을 추가하고 0으로 대체

    train_cols = training.columns.tolist()
    test_cols = testing.columns.tolist()
    
    col_in_train_not_test = set(train_cols) - set(test_cols)
    for col in col_in_train_not_test:
      if col != target_column:
        testing[col] = 0
    
    col_in_test_not_train = set(test_cols) - set(train_cols)
    for col in col_in_test_not_train:
      training[col] = 0
    
    print("Training Data Shape after Processing ",training.shape)
    print("Testing Data Shape after Processing ",testing.shape)
    
    return [training,testing]

train,test = readInputAndEncode("/content/drive/MyDrive/Colab Notebooks/ECC 48기 데과B/9주차/data/",
                                'input_training.csv','input_testing.csv','log_trip_duration')
train.drop(['pickup_date'], axis = 1, inplace = True)
test.drop(['pickup_date'], axis = 1, inplace = True)

train.drop(['pickup_datetime'], axis = 1, inplace = True)
test.drop(['pickup_datetime'], axis = 1, inplace = True)

test_id = test['id']
train.drop(['id'], axis = 1,inplace = True)
test.drop(['id'], axis = 1, inplace = True)

Training Data Shape after Processing  (1458644, 18)
Testing Data Shape after Processing  (625134, 17)

4-3. 모델 적용하기

def GetFeaturesAndSplit(train, test, target,
                        imputing_strategy = 'median', split = 0.25, imputation = True):
    labels = np.array(train[target])
    training = train.drop(target, axis = 1)
    training = np.array(training)
    testing = np.array(test)
    
    if imputation == True:
        imputer = SimpleImputer(strategy = imputing_strategy, missing_values = np.nan)
        imputer.fit(training)
        
        training = imputer.transform(training)
        testing = imputer.transform(testing)
    
    train_features, validation_features, train_labels, validation_labels = train_test_split(training, labels, 
                                                                                            test_size = split, 
                                                                                            random_state = 42)
    
    return [train_features,validation_features,train_labels,validation_labels,testing]

train_features, validation_features, train_labels, validation_labels, testing = GetFeaturesAndSplit(train, test, 
                                                                                                    'log_trip_duration', imputation = False)

a) 선형 회귀(Linear Regression)

### 학습

lm = linear_model.LinearRegression()
lm.fit(train_features, train_labels)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

### 예측

valid_pred = lm.predict(validation_features)

### 평가

rmse = mean_squared_error(validation_labels, valid_pred)
print("Root Mean Squared Error for Linear Regression(log scale): ",rmse)

Root Mean Squared Error for Linear Regression(log scale):  0.4031176249688163

### 제출용 파일 생성

test_pred = lm.predict(testing)
submit = pd.DataFrame()
submit['id'] = test_id
submit['trip_duration'] = np.exp(test_pred)
submit.to_csv("/content/drive/MyDrive/Colab Notebooks/ECC 48기 데과B/9주차/data/submission_linear_regression_baseline.csv",index=False) #0.64221 on Leader board

del submit

b) Random Forest Regressor

rf = RandomForestRegressor(n_estimators = 100, random_state = 42)

### 학습

rf.fit(train_features, train_labels)

RandomForestRegressor(random_state=42)

### 예측

valid_pred_rf = rf.predict(validation_features)
rmse = mean_squared_error(validation_labels, valid_pred_rf)
print("Root Mean Squared Error for Random Forest", rmse)

Root Mean Squared Error for Random Forest 0.16585976592912732

test_pred = rf.predict(testing)
submit = pd.DataFrame()
submit['id'] = test_id
submit['trip_duration'] = np.exp(test_pred)
submit.to_csv("submission_random_forest_baseline.csv",index = False)

Twitter Facebook LinkedIn

0. 라이브러리 & 모듈 import

1. 데이터 준비하기

1-1. Data Description

1-2. 적절한 데이터형으로 변환

2. 데이터 정리

2-1. 결측치 확인

2-2. 데이터가 측정된 기간

2-3. 데이터 가공

a) datetime에서 day, month, hour 정보를 생성

b) 위도/경도 정보 수정

c) 위도 경도를 사용하여 km 단위 거리로 계산하기

3. EDA(Exploratory Data Analysis)

3-1. 데이터 분포 확인

3-2. 장소

a) pickup과 dropoff이 공통적으로 발생하는 장소들에 대한 heatmap

b) pickup이 point에서 시작될 때 여행 지속 시간의 heatmap

3-3. 시간대

a) pickup과 dropoff이 많이 발생하는 시간대

b) 전체적인 pickup 시간대 분포

c) 시간대 별 trip 기간

d) 요일별 pickup 분포 시간

e) 요일별 평균 trip 기간

3-4. 거리, 지역, 속도

a) 거리 분포 확인

b) 여행 지속 시간 & 여행 거리 비교

3-5. 제공해 주는 함수로 여행 방향 측정하기

a) bearing 분포 확인

b) Bearing vs 여행 지속 기간

3-6. 여행 레코드

a) Store and FWD Flag 분포 확인

3-7. 지역 군집화

a) 각 지역에서의 pickup 수

3-8. 속도

a) 평균 속도 분포

b) 일주일의 요일별 평균 속도

4. 모델링(Modeling)

4-1. Test data_특성 공학(Feature Engineering)

4-2. 모델 구축하기

4-3. 모델 적용하기

a) 선형 회귀(Linear Regression)

b) Random Forest Regressor

공유하기

참고

[ECC Github 스터디 11주차] New. 깃허브의 새로운 서비스와 기능

[ECC DS 11주차] 회귀 3_Pycaret Introduction

[ECC DS 10주차] 회귀 2_캐글 주택 가격:고급 회귀 기법

[ECC DS 10주차] 회귀 1_자전거 대여 수요 예측