请 [注册] 或 [登录]  | 返回主站

量化交易吧 /  数理科学 帖子:3353231 新帖:40

给DataFrame数据操作提提速

我们棒棒哒发表于:6 月 13 日 15:51回复(1)

先import需要用到的库, 并初始化一个dataframe, 用于测试¶

import pandas as pd
import time

def timmer(func):
    def warpper(*args,**kwargs):
        start_time = time.time()
        func()
        stop_time = time.time()
        print ("the func run time is %s"%(stop_time - start_time))
    return warpper

def add(num):
    return num+2

df = pd.DataFrame(
    columns=['a','b','c','d','e','f'],
    index=['date','value'],
    data=[['1/1/13 0:00','1/1/13 1:00','1/1/13 2:00','1/1/13 3:00','1/1/13 4:00','1/1/13 5:00'],[0.1,0.2,3,4,5,0.6]]
).T
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
date value
a 1/1/13 0:00 0.1
b 1/1/13 1:00 0.2
c 1/1/13 2:00 3
d 1/1/13 3:00 4
e 1/1/13 4:00 5
f 1/1/13 5:00 0.6

确认我们使用的pandas版本

pd.__version__
'0.23.4'

先测试了字符串转换成datatime的时间,如文章中所说,加入format的确有帮助,速度提升明显,但是也并没有文中描述那么多,我想应该是pandas版本的差异,所以说还是需要自已动手验证一下啊¶

@timmer
def test_1():
    for x in range(1000):
        df['date'] = df['date'].apply(lambda x:'1/1/13 1:00')     
        df['date'] = pd.to_datetime(df['date'])                 
test_1()

@timmer
def test_2():
    for x in range(1000):
        df['date'] = df['date'].apply(lambda x:'1/1/13 1:00')
        df['date'] = pd.to_datetime(df['date'], format='%d/%m/%y %H:%M')
test_2()
the func run time is 3.329446315765381
the func run time is 2.2600209712982178

再来测试数据改写的速度,在这里我们使用了6种不同的方法:¶

从结果中可以发现, 矢量化test_8和.apply排名靠前, 是最佳选择, itertuples和iterrows排在第2梯队, loc速度最慢排在第3

@timmer
def test_3():
    for x in range(1000):
        for x in df.index:                  
             df.loc[x, 'value'] = df.loc[x, 'value'] + 0.002
@timmer
def test_4():
    for x in range(1000):
        for index,row in df.iterrows():                             
            row['value'] = row['value'] + 0.002
@timmer
def test_5():
    for x in range(1000):
        for index,date,value in df.itertuples():                       
            value = value + 0.002
@timmer
def test_6():
    for x in range(1000):
        df['value'] = df['value'].apply(add)    
@timmer
def test_7():
    for x in range(1000):
        df['value'] = df['value'].apply(lambda x:x+2)        
@timmer
def test_8():
    for x in range(1000):
        df['value']+=0.002      
        
test_3()
test_4()
test_5()
test_6()
test_7()
test_8()
the func run time is 5.793862819671631
the func run time is 1.096304178237915
the func run time is 0.9203734397888184
the func run time is 0.5614016056060791
the func run time is 0.5276052951812744
the func run time is 0.534296989440918

不仅了解了几种常用的数据操作方法,也找出了更优的方案¶

全部回复

0/140

量化课程

    移动端课程