首页 > python > 用于将鬼行附加到Python中的现有数据帧的优化算法

用于将鬼行附加到Python中的现有数据帧的优化算法 (Optimized Algorithm for appending ghost row to an existing dataframe in Python)

问题

我有一个数据帧,我想将ghost行(现有行的副本)附加到数据帧。

       id   month  as_of_date1 turn  age 
119 5712    201401  2014-01-01  9   0
120 5712    201402  2014-02-01  9   1
121 5712    201403  2014-03-01  9   2
122 5712    201404  2014-04-01  9   3
123 5712    201405  2014-05-01  9   4
124 5712    201406  2014-06-01  9   5
125 9130    201401  2014-01-01  9   0
126 9130    201402  2014-02-01  9   1
127 9130    201403  2014-03-01  9   2
128 9130    201404  2014-04-01  9   3
129 9130    201405  2014-05-01  9   4

幽灵行由条件选择:如果年龄小于转弯,我们需要追加最新的行直到age== turn ofas_of_date1 == now()

现在我正在使用以下代码,但由于数据大约200k行,有100个字段,它需要永远

tdf1=tdf.loc[(tdf['age']<tdf['turn'])]
tdf2=tdf1.drop_duplicates(subset=['id'],keep='last')
leads=tdf2.index.tolist()
for lead in leads:
    ttdf=tdf.loc[[lead]]
    diff1 = relativedelta.relativedelta(datetime.datetime(2018,6,1),tdf.loc[lead,'as_of_date1']).months
    diff2=tdf.loc[lead,'turn']-tdf.loc[lead,'age']
    diff=min(diff1,diff2)
    for i in range(0,diff):
        tdf = tdf.append(ttdf, ignore_index=True)

预期结果:

    id   month  as_of_date1 turn  age 
119 5712    201401  2014-01-01  9   0
120 5712    201402  2014-02-01  9   1
121 5712    201403  2014-03-01  9   2
122 5712    201404  2014-04-01  9   3
123 5712    201405  2014-05-01  9   4
124 5712    201406  2014-06-01  9   5
125 9130    201401  2014-01-01  9   0
126 9130    201402  2014-02-01  9   1
127 9130    201403  2014-03-01  9   2
128 9130    201404  2014-04-01  9   3
129 9130    201405  2014-05-01  9   4
130 5712    201406  2014-06-01  9   5
131 5712    201406  2014-06-01  9   5
132 5712    201406  2014-06-01  9   5
133 5712    201406  2014-06-01  9   5
134 9130    201405  2014-05-01  9   4
135 9130    201405  2014-05-01  9   4
136 9130    201405  2014-05-01  9   4
137 9130    201405  2014-05-01  9   4
138 9130    201405  2014-05-01  9   4

如果有人知道更快的算法,我将不胜感激

解决方法

正如@Parfit在评论中提到的那样,附加到数据帧实际上是内存消耗,并且在循环中执行它根本不建议。所以我使用了以下令人难以置信的提高速度

a=[]
for lead in leads:
    ttdf=tdf.loc[[lead]]
    diff1 = relativedelta.relativedelta(datetime.datetime(2018,6,1),tdf.loc[lead,'as_of_date1']).months
    diff2=tdf.loc[lead,'turn']-tdf.loc[lead,'age']
    diff=min(diff1,diff2)
    for i in range(0,diff):
        a.append(ttdf)

tdf = tdf.append(a, ignore_index=True)

问题

I have a dataframe and I want to append ghost rows ( copy of existing row) to the dataframe.

       id   month  as_of_date1 turn  age 
119 5712    201401  2014-01-01  9   0
120 5712    201402  2014-02-01  9   1
121 5712    201403  2014-03-01  9   2
122 5712    201404  2014-04-01  9   3
123 5712    201405  2014-05-01  9   4
124 5712    201406  2014-06-01  9   5
125 9130    201401  2014-01-01  9   0
126 9130    201402  2014-02-01  9   1
127 9130    201403  2014-03-01  9   2
128 9130    201404  2014-04-01  9   3
129 9130    201405  2014-05-01  9   4

The ghost rows are selected by conditions: if age is less than turn we need to append the latest row till age== turn of or as_of_date1 == now()

right now I'm using the following code but since the data is large around 200k rows with 100 fields it takes for ever

tdf1=tdf.loc[(tdf['age']<tdf['turn'])]
tdf2=tdf1.drop_duplicates(subset=['id'],keep='last')
leads=tdf2.index.tolist()
for lead in leads:
    ttdf=tdf.loc[[lead]]
    diff1 = relativedelta.relativedelta(datetime.datetime(2018,6,1),tdf.loc[lead,'as_of_date1']).months
    diff2=tdf.loc[lead,'turn']-tdf.loc[lead,'age']
    diff=min(diff1,diff2)
    for i in range(0,diff):
        tdf = tdf.append(ttdf, ignore_index=True)

Expected outcome:

    id   month  as_of_date1 turn  age 
119 5712    201401  2014-01-01  9   0
120 5712    201402  2014-02-01  9   1
121 5712    201403  2014-03-01  9   2
122 5712    201404  2014-04-01  9   3
123 5712    201405  2014-05-01  9   4
124 5712    201406  2014-06-01  9   5
125 9130    201401  2014-01-01  9   0
126 9130    201402  2014-02-01  9   1
127 9130    201403  2014-03-01  9   2
128 9130    201404  2014-04-01  9   3
129 9130    201405  2014-05-01  9   4
130 5712    201406  2014-06-01  9   5
131 5712    201406  2014-06-01  9   5
132 5712    201406  2014-06-01  9   5
133 5712    201406  2014-06-01  9   5
134 9130    201405  2014-05-01  9   4
135 9130    201405  2014-05-01  9   4
136 9130    201405  2014-05-01  9   4
137 9130    201405  2014-05-01  9   4
138 9130    201405  2014-05-01  9   4

I would appreciate if anyone knows a faster algorithm

解决方法

As @Parfit mentioned in comments appending to a dataframe is really memory consuming and doing it within the loop is not adviced at all. so I have used the following which incredibly increased the speed

a=[]
for lead in leads:
    ttdf=tdf.loc[[lead]]
    diff1 = relativedelta.relativedelta(datetime.datetime(2018,6,1),tdf.loc[lead,'as_of_date1']).months
    diff2=tdf.loc[lead,'turn']-tdf.loc[lead,'age']
    diff=min(diff1,diff2)
    for i in range(0,diff):
        a.append(ttdf)

tdf = tdf.append(a, ignore_index=True)
相似信息