과제 3-1, 3-2 Low Complexity

과제 3-1은 주어진 DNA 시퀀스 파일에서 낮은 복잡도 영역의 시작 위치를 찾아서 출력하고, 각 시퀀스에 소요된 시간을 측정하여 출력해줍니다. 낮은 복잡도 영역은 동일한 DNA 서브 시퀀스가 반복되는 영역입니다.

코드의 주요 부분들은 다음과 같습니다:

removeBlank 함수: 문자열에서 공백과 개행 문자를 제거하고, 모든 문자를 대문자로 변환합니다.
checkDnaData 함수: 문자열이 올바른 DNA 시퀀스(A, T, G, C로만 구성된 문자열)인지 확인합니다.
find_low_complexity 함수: DNA 시퀀스에서 낮은 복잡도 영역의 시작 위치를 찾습니다. 3-1은 정규 표현식을 사용하여 낮은 복잡도 영역을 찾습니다. 3-2는 문자열 인덱싱 및 반복문을 통해 낮은 복잡도 영역을 찾습니다.
파일 처리: 파일의 내용을 읽어온 다음, 각 DNA 시퀀스를 처리하고 결과를 출력합니다.

코드의 실행 과정은 다음과 같습니다:

파일 경로를 입력 인수로 받습니다.
파일을 읽고, 각 DNA 시퀀스를 리스트로 분리합니다.
각 시퀀스에 대해 공백과 개행 문자를 제거하고, 올바른 DNA 시퀀스인지 확인합니다.
각 시퀀스에 대해 낮은 복잡도 영역의 시작 위치를 찾고, 소요 시간을 측정합니다.
각 시퀀스의 처리 결과와 소요 시간을 출력하고, 결과를 파일에 저장합니다.

결과적으로, 이 코드는 주어진 DNA 시퀀스 파일에서 낮은 복잡도 영역을 찾아서 출력하고, 각 시퀀스의 처리 시간을 측정하여 출력합니다.

* 코드에 설명이 주석으로 달려있음

3-1

import sys
import re
import time

# 문자열의 공백과 모든 개행문자(Whitespace)를 제거하고 대문자로 변환
def removeBlank(oneLine):
    return re.sub("\s+","",oneLine).upper()

# ATGC 로 이루어진 문자열인지 확인
def checkDnaData(oneLine):
    if re.match("^(A|T|G|C)+$", oneLine):
        return
    else:
        print("No DNA sequence.\n" + oneLine)
        exit()

# 낮은 복잡도 시작 위치 찾기 with RE
def find_low_complexity(DNA_data):
    LC_positions = []

    # ?P< > : capture 그룹을 생성
    # [ACGT]{2,5}: ACGT로 이루어진 길이가 2이상 5이하인 sub_string
    # (?P=cap){2,} : cap과 일치하는 내용이 2번 이상 연속
    # => 2번 이상인 이유는 cap 생성부분에서 이미 하나가 나왔기 때문
    pattern = re.compile(r"((?P<cap>[ACGT]{2,5})(?P=cap){2,})")

    for match in pattern.finditer(DNA_data):
        # 낮은 복잡도 영역의 시작 위치를 low_complexity_positions에 추가
        LC_positions.append(match.start())
    
    return LC_positions


# 인수로 파일 주소 받기 ==================================
file_path = sys.argv[1]
txt_list = []

# 파일 확인
try:
    with open(file_path, "r") as file:
        input_data = file.read()
        if not input_data:
            print("no DNA sequence")
            exit()
        elif input_data.startswith('>'):
            # '>'로 시작하고 개행 문자로 끝나는 줄을 기준으로 문자열을 분리.
            split_data = re.split(r">.*?\n", input_data)

            if not split_data[0]:
                split_data.pop(0)

            for i in range(len(split_data)):
                split_data[i] = removeBlank(split_data[i])
                checkDnaData(split_data[i])
        else:
            print("No correct format\n파일의 시작은 '>'이여야 합니다.")
            exit()
    
    # 함수 최적화를 위해 한번 호출해준다.
    find_low_complexity("AATGC")

    combined_string = []
    total_time = 0
    for j in range(len(split_data)):
        
        tm_join = "".join(split_data[j])
        # 하나의 DNA시퀀스 시간 측정 시작
        start_time = time.perf_counter()

        # 낮은 복잡도 시작 위치 찾기 with RE
        low_complex_positions = find_low_complexity(tm_join)

        # 하나의 DNA시퀀스 시간 측정 종료 후 저장
        required_time = (time.perf_counter() - start_time) * 1e6 
        total_time += required_time
        
        if not low_complex_positions:
            print(f"Time required for sequence{j+1} : {required_time:.5f} (us) , ", end = '')
            low_complex_positions = 'No low-complexity region found'
            print(low_complex_positions)
            combined_string.append(low_complex_positions)
        else:
            print(f"Time required for sequence{j+1} : {required_time:.5f} (us)    ")
            combined_string.append("\n".join([str(i) for i in low_complex_positions]))
    
    forFile_string = "\n\n".join(f"seq{k+1}:\n{string}" for k, string in enumerate(combined_string))

    #결과 파일 저장
    with open('output3-1.txt', 'w') as f:
        f.write(forFile_string)
        print("정상 작동, output3-1.txt에 저장 완료\n")

    print(f"All time spent: {total_time:.5f} (us)\n")
        
except IOError:
    print("파일을 읽을 수 없습니다.")
    exit()

3-2

import sys
import re
import time

# 문자열의 공백과 모든 개행문자(Whitespace)를 제거하고 대문자로 변환
def removeBlank(oneLine):
    return re.sub("\s+","",oneLine).upper()

# ATGC 로 이루어진 문자열인지 확인
def checkDnaData(oneLine):
    if re.match("^(A|T|G|C)+$", oneLine):
        return
    else:
        print("No DNA sequence.\n" + oneLine)
        exit()

# 낮은 복잡도 시작 위치 찾기 Not_RE
def find_low_complexity(sequence):
    start_positions = []  # 낮은 복잡성 영역의 시작 위치를 저장할 리스트 생성
    index = 0  # 서열을 순회할 인덱스 초기화

    while index < len(sequence) - 1:
        # 크기가 2에서 5 사이인 서열 조각에 대해 반복문을 실행
        for size in range(2, 6):
            segment = sequence[index:index+size]  # 현재 인덱스에서부터 size만큼의 조각을 추출
            repeat_count = 1  # 반복 횟수 초기화

            # 현재 인덱스 다음부터 size 간격으로 조각을 비교하는 반복문 실행
            for next_index in range(index + size, len(sequence), size):
                if sequence[next_index:next_index+size] == segment:  # 동일한 조각이 있는 경우
                    repeat_count += 1  # 반복 횟수 증가
                else:
                    break  # 동일한 조각이 없으면 반복문 종료

            # 반복 횟수가 3 이상인 경우 낮은 복잡성 영역으로 간주하고 시작 위치를 리스트에 추가
            if repeat_count >= 3:
                start_positions.append(index)
                index += size * repeat_count - size  # 낮은 복잡성 영역 다음 위치로 인덱스 이동
                break
        else:
            index += 1  # 낮은 복잡성 영역이 아닌 경우 인덱스를 1 증가시킴

    return start_positions  # 낮은 복잡성 영역의 시작 위치 리스트를 반환


# 인수로 파일 주소 받기 ==================================
file_path = sys.argv[1]
txt_list = []

# 파일 확인
try:
    with open(file_path, "r") as file:
        input_data = file.read()
        if not input_data:
            print("no DNA sequence")
            exit()
        elif input_data.startswith('>'):
            # '>'로 시작하고 개행 문자로 끝나는 줄을 기준으로 문자열을 분리.
            split_data = re.split(r">.*?\n", input_data)

            if not split_data[0]:
                split_data.pop(0)

            for i in range(len(split_data)):
                split_data[i] = removeBlank(split_data[i])
                checkDnaData(split_data[i])
        else:
            print("No correct format\n파일의 시작은 '>'이여야 합니다.")
            exit()
    
    # 함수 최적화를 위해 한번 호출해준다.
    find_low_complexity("AATGC")

    combined_string = []
    total_time = 0
    for j in range(len(split_data)):
        
        tm_join = "".join(split_data[j])
        # 하나의 DNA시퀀스 시간 측정 시작
        start_time = time.perf_counter()

        # 낮은 복잡도 시작 위치 찾기 with RE
        low_complex_positions = find_low_complexity(tm_join)

        # 하나의 DNA시퀀스 시간 측정 종료 후 저장
        required_time = (time.perf_counter() - start_time) * 1e6 
        total_time += required_time
        
        if not low_complex_positions:
            print(f"Time required for sequence {j+1} : {required_time:.5f} (us) , ", end = '')
            low_complex_positions = 'No low-complexity region found'
            print(low_complex_positions)
            combined_string.append(low_complex_positions)
        else:
            print(f"Time required for sequence {j+1} : {required_time:.5f} (us)    ")
            combined_string.append("\n".join([str(i) for i in low_complex_positions]))
  
    forFile_string = "\n\n".join(f"seq{k+1}:\n{string}" for k, string in enumerate(combined_string))

    #결과 파일 저장
    with open('output3-2.txt', 'w') as f:
        f.write(forFile_string)
        print("정상 작동, output3-2.txt에 저장 완료\n")

    print(f"All time spent: {total_time:.5f} (us)\n")
        
except IOError:
    print("파일을 읽을 수 없습니다.")
    exit()

'2023 상반기 > 바이오 컴퓨팅' 카테고리의 다른 글

Pattern Matching & Pattern Finding(패턴 매칭) (0)	2023.04.18
Entropy(Hartley, Shannon) & Pattern Matching (0)	2023.04.12
DNA sequence(DNA 시퀀스)에서 dynamic programming(동적 계획법) 이용 Longest Common Subsequences(LCS, 최장 공통 부분 수열) 찾는 python 코드 (0)	2023.04.04
dynamic programming (동적 계획법) Manhattan Tourist problem(MTP), Longest Common Subsequences (LCS) (0)	2023.04.04
low-complexity regions(낮은 복잡도 영역) 과제 3-1 (0)	2023.03.29

Concho

과제 3-1, 3-2 Low Complexity

'2023 상반기 > 바이오 컴퓨팅' 카테고리의 다른 글

댓글

티스토리툴바

과제 3-1, 3-2 Low Complexity

'2023 상반기 > 바이오 컴퓨팅' 카테고리의 다른 글

관련글

댓글

티스토리툴바