GAE 資料庫處理海量資料的方法(超過1000筆資料)
GAE 的麻煩之處就在於:
1. 每次request 只有30秒限時
2. 每次query Db 最多只能拿1000筆
如果要進行資料格式轉換時, 這將是非常麻煩的限制
要突破這個限制, 目前只有一法
1. 開remote api
2. 執行類似下列的指令:
class batch_MarkAllOrderItemAsUnhandledAndFormatOld(DirectHandler):
def get(self):
logging.info('batch_MarkAllOrderItemAsUnhandledAndFormatOld start')
count = 0
query = Order.all()
query.order('__key__')
while count % 1000 == 0:
current_count = query.count()
# if no more item can be fetch
if current_count == 0:
break
items = query.fetch(1000,0)
for item in items:
item.format = "old"
item.batch_processed = False
item.put()
count += current_count
if current_count == 1000:
# get key of the last item of current query
last_key = query.fetch(1, 999)[0].key()
query = query.filter('__key__ > ', last_key)
由於是在自己電腦端跑, 沒有30秒限時
由於是用console 類型的介面, 他每跑完一筆就會print 一次
監測目前跑的狀態很方便, 也給人一種進行感
這樣就可以跑完所有的資料, 只是當然資料筆數越多
跑的時間越長
2014/2/19 補充更完整的新範例:
def batch_process_blog_page():
from db.models import Page
total_counter = 0
count = 0
query = Page.all()
query.order('__key__')
while count % 1000 == 0:
current_count = query.count()
# if no more item can be fetch
if current_count == 0:
break
pages = query.fetch(1000,0)
for page in pages:
total_counter = total_counter + 1
print 'page NO. %s processing, id: %s...' % (total_counter, page.id)
try:
if page.category == "Personal blog":
page.in_official_blog_page_pool = True
else:
page.in_official_blog_page_pool = False
page.put()
except:
print 'something goes wrong when processing page NO. %s, id: %s' % (total_counter, page.id)
# if current_count < 1000, then new value of count % 1000 will not be 0, which end this while loop
count += current_count
if current_count == 1000:
# get key of the last item of current query
last_key = query.fetch(1, 999)[0].key()
query = query.filter('__key__ > ', last_key)
1. 每次request 只有30秒限時
2. 每次query Db 最多只能拿1000筆
如果要進行資料格式轉換時, 這將是非常麻煩的限制
要突破這個限制, 目前只有一法
1. 開remote api
2. 執行類似下列的指令:
class batch_MarkAllOrderItemAsUnhandledAndFormatOld(DirectHandler):
def get(self):
logging.info('batch_MarkAllOrderItemAsUnhandledAndFormatOld start')
count = 0
query = Order.all()
query.order('__key__')
while count % 1000 == 0:
current_count = query.count()
# if no more item can be fetch
if current_count == 0:
break
items = query.fetch(1000,0)
for item in items:
item.format = "old"
item.batch_processed = False
item.put()
count += current_count
if current_count == 1000:
# get key of the last item of current query
last_key = query.fetch(1, 999)[0].key()
query = query.filter('__key__ > ', last_key)
由於是在自己電腦端跑, 沒有30秒限時
由於是用console 類型的介面, 他每跑完一筆就會print 一次
監測目前跑的狀態很方便, 也給人一種進行感
這樣就可以跑完所有的資料, 只是當然資料筆數越多
跑的時間越長
2014/2/19 補充更完整的新範例:
def batch_process_blog_page():
from db.models import Page
total_counter = 0
count = 0
query = Page.all()
query.order('__key__')
while count % 1000 == 0:
current_count = query.count()
# if no more item can be fetch
if current_count == 0:
break
pages = query.fetch(1000,0)
for page in pages:
total_counter = total_counter + 1
print 'page NO. %s processing, id: %s...' % (total_counter, page.id)
try:
if page.category == "Personal blog":
page.in_official_blog_page_pool = True
else:
page.in_official_blog_page_pool = False
page.put()
except:
print 'something goes wrong when processing page NO. %s, id: %s' % (total_counter, page.id)
# if current_count < 1000, then new value of count % 1000 will not be 0, which end this while loop
count += current_count
if current_count == 1000:
# get key of the last item of current query
last_key = query.fetch(1, 999)[0].key()
query = query.filter('__key__ > ', last_key)
留言
張貼留言