Threading과 Starmap의 차이

데이터 과학

by Taeyoon.Kim.DS 2024. 1. 23. 20:18

    thread_list = []
    for hash_path in hash_path_list:
        hash_path.replace("/","")
        thread = threading.Thread(target=storage.download_image_from_s3_to_temporary_directory_using_cdn, args=(hash_path, path))
        thread.start()
        thread_list.append(thread)
    for thread in thread_list:
        thread.join()

thread_list를 생성하고, 각각의 hash_path_list에서 hash_path를 읽어온 후에 "/"를 공백으로 대체한다.

각각의 thread는, threading의 thread이고, 그 target는 cdn를 사용해서 image를 download하는 메소드이고, 인자로서 각각의 hash_path를 전달하고, path를 전달한다. 여깅서 path는 따로 지정이 안되있어 보인다. 생성된 thread는 start()되고, thread_list에 추가된다. thread_list에 추가된 각각의 thread가 join()된다.

Threading with threading.Thread:

Threads are lightweight and share the same memory space, making them suitable for I/O-bound tasks where the bottleneck is waiting for external resources, such as network or disk operations.
In Python, due to the Global Interpreter Lock (GIL), threads are not well-suited for CPU-bound tasks where parallel execution is needed for performance improvement.
The GIL allows only one thread to execute Python bytecode at a time, which limits the parallelism in CPU-bound tasks.

In your first code snippet, you are using the threading module to create multiple threads, each responsible for downloading an image from S3. The threads run concurrently, and the Global Interpreter Lock limitations may be less significant for I/O-bound tasks like downloading from the network.

반면에 Multiprofessing.Pool의 starmap은 CPU bound 태스크이다.

def download_images_using_starmap(hash_path_list):
    pool = multiprocessing.Pool()
    pool.starmap(storage.download_image_from_s3_to_temporary_directory_using_cdn, hash_path_list, chunksize=10)
    pool.close()
    pool.join()

여기서는 제공되어지는 hash_path_list가 다를 것이라고 예상된다. training에서는 250개의 전체 hash_path_list가 있을 것이고 prediction에서는 각각의 listing에 해당하는 6~7개의 image hash path가 있을 것이다. 확인해보면,

download_images_using_starmap(zip_hash_list_and_path_list(hash_list, path))로 되어있고, zip_hash_list_and_path_list는 hash_list를 받게 되는데, 여기서 hash_list는 최종적으로 parallel_download_images_from_s3 메소드에서 사용되며, image_model.py의 add_images_to_directories에서 사용된다. parallel_download_images_from_s3의 인자는 dir_lists["train_infringing_hashs"]인데, 이 list는 전체 self.hash_lists["infringing_hashs"]에서 75%를 나눈 전체 list이다.

반면에 download_images_using_threading 메소드의 인자는 image_list인데, image_list는 각 row["images"]이므로 7~8개 혹은 그 이하이다.

Threading의 경우 스레드 생성, 각 스레드에 이미지 배치 후 다운로드.

pool의 경우 전체에서 배치사이즈 10만큼씩 가져와서 다운로드. 그러므로 두 메소드는 다르게 쓰이는 것이다 --> 현재 코드는 전체 dataframe에서 200개씩 배치로 가져와서 prediction을 하는 형태가 아니므로.

저작자표시 비영리 변경금지 (새창열림)

'데이터 과학' 카테고리의 다른 글

[Terraform] Terraform Certified Associate (0)	2024.02.02
[Terraform] Setup (0)	2024.02.01
Extract dates using regular expression (1)	2024.01.05
MLOps platforms (1)	2024.01.05
MLOps Orchestrator (0)	2024.01.04