活动介绍
file-type

Java高效查找数组中重复元素的算法实现

ZIP文件

下载需积分: 5 | 1KB | 更新于2025-03-02 | 166 浏览量 | 0 下载量 举报 收藏
download 立即下载
根据给定的文件信息,我们可以提炼出以下知识点: 1. **Java编程语言应用** - Java是一种广泛使用的面向对象的编程语言,它具有跨平台特性,即一次编写,到处运行的能力。它由Sun Microsystems公司于1995年发布,目前由Oracle公司维护。 - Java提供了丰富的API,使得开发者能够轻松实现数据结构、网络编程、多线程、图形用户界面等多种功能。 - 文件“Find_Duplicates”很可能是一个Java程序,用于解决特定问题——查找数组中的重复元素。 2. **查找重复元素的算法** - 在编程中,查找数组或集合中的重复元素是一个常见的问题,解决方案有很多种。 - 时间复杂度O(N)和空间复杂度O(N)意味着算法在处理数据时的时间和空间需求都与输入数组的长度成正比。这种复杂度通常表示算法效率较高,因为它不依赖于输入数组的大小的多项式关系。 3. **时间复杂度和空间复杂度的概念** - 时间复杂度用来描述算法的执行时间,通常用大O表示法来表示,它描述了随着输入规模增长,算法执行时间的增长趋势。 - 空间复杂度用来描述算法在运行过程中临时占用存储空间大小与输入数据量之间的关系。同样也用大O表示法来表示。 4. **数组数据结构** - 数组是一种基本的数据结构,用于存储一系列相同类型的元素。在Java中,数组一旦被创建,其大小就是固定的。 - 在本例中,数组是int类型的,表示存储的数据元素为整数。 5. **查找重复元素的算法实现** - 实现查找重复元素的算法通常需要一些额外的数据结构,如哈希表(或在Java中的HashMap)来记录元素出现的次数。 - 一个常见的算法策略是遍历数组,对于每一个元素,检查其是否已经在哈希表中记录。如果是,则该元素是重复的;如果不是,则将其添加到哈希表中。 6. **Java中的HashMap** - 在Java中,HashMap是一个基于哈希表的Map接口的实现,它允许我们存储键值对,其中键是唯一的。 - HashMap具有常数时间复杂度O(1)的平均性能,对于查找操作(包括查找重复元素)来说,这是一个非常高效的数据结构。 7. **“Find_Duplicates-master”文件命名分析** - 这个名称暗示该文件可能是包含“Find_Duplicates”Java程序的压缩包中的主文件。 - 文件名中的“master”可能表示这是该程序版本控制仓库中的主要分支,例如Git中的master分支。 综合以上知识点,我们可以推断出文件“Find_Duplicates”可能是一个Java程序,其功能是查找int数组中的重复元素,该程序在时间复杂度和空间复杂度方面表现良好,使用了高效的算法,例如利用HashMap数据结构来记录和检查元素出现的次数。这样的程序在处理大数据集时特别有用,因为其算法效率能够保证在合理的时间和空间开销下完成任务。

相关推荐

filetype

你好,请帮我修改下列代码,没有完成把一个文件夹里面和另一个文件夹下的子文件夹下的images和Annotations文件夹里面具有相同名字的文件移动出来,import os import hashlib import shutil def hash_file(file_path): """计算文件的MD5哈希值""" hasher = hashlib.md5() with open(file_path, 'rb') as f: buf = f.read() hasher.update(buf) return hasher.hexdigest() def ensure_dir_exists(path): """确保路径存在,不存在则创建""" if not os.path.exists(path): os.makedirs(path) def move_with_structure(src_root, dest_root, file_path, hashes_src): """根据源文件路径和目标基础路径,移动文件并保持目录结构""" relative_path = os.path.relpath(os.path.dirname(file_path), src_root) dest_path = os.path.join(dest_root, relative_path) ensure_dir_exists(dest_path) # 计算文件哈希值并检查是否需要移动 file_hash = hash_file(file_path) if file_hash in hashes_src: shutil.move(file_path, os.path.join(dest_path, os.path.basename(file_path))) print(f"Moved: {file_path} to {dest_path}") def find_and_move_duplicates(src_img_folder, dest_base_folder, output_folder): """查找并移动重复图片和对应的XML文件,同时保持目录结构""" # 存储第一个文件夹中的图片文件哈希值 hashes_src = {} for root, _, files in os.walk(src_img_folder): for file in files: full_path = os.path.join(root, file) file_hash = hash_file(full_path) hashes_src[file_hash] = full_path # 遍历所有子文件夹,查找重复图片文件及其对应的XML文件,并移动 for subdir in os.listdir(dest_base_folder): # 遍历上一级目录下的所有子文件夹 subdir_path = os.path.join(dest_base_folder, subdir) if os.path.isdir(subdir_path): images_folder = os.path.join(subdir_path, 'images') annotations_folder = os.path.join(subdir_path, 'Annotations') if os.path.exists(images_folder) and os.path.exists(annotations_folder): for root, _, files in os.walk(images_folder): for img_file in files: img_full_path = os.path.join(root, img_file) move_with_structure(images_folder, output_folder, img_full_path, hashes_src) # 构建对应的xml文件路径 xml_file = os.path.splitext(img_file)[0] + '.xml' xml_full_path = os.path.join(annotations_folder, os.path.relpath(root, images_folder), xml_file) # 检查并移动XML文件 if os.path.exists(xml_full_path): move_with_structure(annotations_folder, output_folder, xml_full_path, hashes_src) else: print(f"Warning: No matching XML found for {img_file}") else: print(f"Warning: Missing images or Annotations folder in {subdir_path}") # 设置你的文件夹路径 src_img_folder = r'E:\ymc\data\BOC\1' dest_base_folder = r'E:\ymc\data\BOC\BOC_Result' # 包含多个子文件夹的父目录,每个子文件夹包含images和Annotations文件夹 output_folder = r'E:\ymc\data\BOC\2' find_and_move_duplicates(src_img_folder, dest_base_folder, output_folder)

filetype

from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from webdriver_manager.chrome import ChromeDriverManager import time import pandas as pd import re import random def setup_driver(): chrome_options = Options() chrome_options.add_argument('--disable-blink-features=AutomationControlled') chrome_options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36') # 数学表达式计算滚动间隔 wait_time = random.uniform(1.5, 3.5) # 动态滚动间隔[^2] service = Service(ChromeDriverManager().install()) driver = webdriver.Chrome(service=service, options=chrome_options) return driver def clean_likes(likes_str): """清洗点赞数""" if '万' in likes_str: return float(re.sub(r'[^\d.]', '', likes_str)) * 10000 return int(re.sub(r'[^\d]', '', likes_str)) def get_xhs_comments(url, max_scroll=10): driver = setup_driver() try: driver.get(url) time.sleep(5) comments = [] last_height = driver.execute_script("return document.documentElement.scrollHeight") # 动态计算滚动高度(Python代码) import random def calculate_scroll(scrollHeight): # 公式实现:Δh = H/2 × (1+ε), ε~U(0,1) delta_h = (scrollHeight / 2) * (1 + random.random()) # [^1] # 生成JavaScript滚动指令(带LaTeX公式注释) js_code = f""" // 公式:$\\Delta h = \\frac{{H}}{{2}} \\times (1+\\epsilon)$ window.scrollTo({{ top: {delta_h}, behavior: 'smooth' }}); """ return js_code comment_blocks = WebDriverWait(driver, 15).until( EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.comment-item')) ) for block in comment_blocks: try: comment_data = { 'user': block.find_element(By.CSS_SELECTOR, '.nickname').text, 'content': block.find_element(By.CSS_SELECTOR, '.content').text, 'likes': clean_likes(block.find_element(By.CSS_SELECTOR, '.like-count').text), 'time': pd.to_datetime(block.find_element(By.CSS_SELECTOR, '.time').text, errors='coerce') } if comment_data not in comments: comments.append(comment_data) except Exception as e: print(f"解析错误: {str(e)}") new_height = driver.execute_script("return document.documentElement.scrollHeight") if new_height == last_height: break last_height = new_height df = pd.DataFrame(comments) # 数据清洗 df = df.drop_duplicates(subset=['content']) df.to_csv('xhs_comments.csv', index=False, encoding='utf-8-sig') return df finally: driver.quit() # 使用示例 if __name__ == "__main__": url = "https://www.xiaohongshu.com/explore/6809b30b000000001c03375f" df = get_xhs_comments(url, max_scroll=15) print(f"成功获取{len(df)}条评论") 这段代码发生了以下错误,请帮我改正 Cell In[5], line 75 break ^ SyntaxError: 'break' outside loop

filetype

TimeoutException Traceback (most recent call last) Cell In[5], line 77 75 if __name__ == "__main__": 76 url = "https://s.taobao.com/search?q=咖啡预包装&tab=mall" ---> 77 df = crawl_tmall_products(url, max_pages=3) 79 # 改进数据清洗 80 df = df.drop_duplicates(subset=["商品名称", "店铺名称"]) Cell In[5], line 52, in crawl_tmall_products(url, max_pages) 49 data = [] 51 for _ in range(max_pages): ---> 52 WebDriverWait(driver, 20).until( 53 EC.presence_of_element_located((By.CSS_SELECTOR, ".Card--doubleCardWrapper--L2XFE73")) 54 ) 55 items = driver.find_elements(By.CSS_SELECTOR, ".Card--doubleCardWrapper--L2XFE73") 57 for item in items: File ~/anaconda3/lib/python3.11/site-packages/selenium/webdriver/support/wait.py:146, in WebDriverWait.until(self, method, message) 144 break 145 time.sleep(self._poll) --> 146 raise TimeoutException(message, screen, stacktrace) TimeoutException: Message: Stacktrace: 0 chromedriver 0x0000000104efd568 chromedriver + 6088040 1 chromedriver 0x0000000104ef517a chromedriver + 6054266 2 chromedriver 0x0000000104994540 chromedriver + 415040 3 chromedriver 0x00000001049e60a0 chromedriver + 749728 4 chromedriver 0x00000001049e62f1 chromedriver + 750321 5 chromedriver 0x0000000104a36764 chromedriver + 1079140 6 chromedriver 0x0000000104a0c41d chromedriver + 906269 7 chromedriver 0x0000000104a33a19 chromedriver + 1067545 8 chromedriver 0x0000000104a0c1c3 chromedriver + 905667 9 chromedriver 0x00000001049d805a chromedriver + 692314 10 chromedriver 0x00000001049d91b1 chromedriver + 696753 11 chromedriver 0x0000000104ebcc90 chromedriver + 5823632 12 chromedriver 0x0000000104ec0b44 chromedriver + 5839684 13 chromedriver 0x0000000104e97e86 chromedriver + 5672582 14 chromedriver 0x0000000104ec153b chromedriver + 5842235 15 chromedriver 0x0000000104e86824 chromedriver + 5601316 16 chromedriver 0x0000000104ee3618 chromedriver + 5981720 17 chromedriver 0x0000000104ee37df chromedriver + 5982175 18 chromedriver 0x0000000104ef4d58 chromedriver + 6053208 19 libsystem_pthread.dylib 0x00007ff818d134e1 _pthread_start + 125 20 libsystem_pthread.dylib 0x00007ff818d0ef6b thread_start + 15 返回以上错误,怎么修改,如何才能将正确的数据保存在excel中

filetype

代码: import pandas as pd from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.common.by import By import csv import os import time import json class spider(object): def __init__(self,type,page): self.type = type #岗位关键字 self.page = page #page是当前的页码数 self.spiderUrl = 'https://www.zhipin.com/web/geek/job?query=%s&city=101281100&page=%s' def startBrower(self): service = Service('C:/Users/23653/PycharmProjects/chromedriver.exe') options = webdriver.ChromeOptions() #使用浏览器复用来防反爬虫,使用前,不允许任何浏览器在运行 options.add_experimental_option('debuggerAddress','localhost:9222') #options.add_experimental_option('excludeSwitches', ['enable-automation']) brower = webdriver.Chrome(service=service, options=options) return brower def main(self, page): #page是要爬取的总页码数 #if self.page > page:return brower = self.startBrower() print("正在爬取页面路径:" + self.spiderUrl % (self.type, self.page)) brower.get(self.spiderUrl % (self.type, self.page)) time.sleep(15) #等待页面时间15秒 job_list = brower.find_elements(by=By.XPATH, value= '//ul[@class="job-list-box"]/li') for index, job in enumerate(job_list): try: jobData = [] print("正在爬取第%d个数据" % (index + 1)) # 岗位名字 title = job.find_element(by=By.XPATH,value=".//a[@class='job-card-left']/div[contains(@class,'job-title')]/span[@class='job-name']").text # 省份地址 addresses = job.find_element(by=By.XPATH,value=".//a[@class='job-card-left']/div[contains(@class,'job-title')]/span[@class='job-area-wrapper']/span").text.split('·') address = addresses[0] # 行政区 if len(addresses) != 1: dist = addresses[1] else: dist = '' # 岗位 type = self.type tag_list = job.find_elements(by=By.XPATH,value=".//a[@class='job-card-left']/div[contains(@class,'job-info')]/ul[@class='tag-list']/li") if len(tag_list) == 2: # 学历 educational = tag_list[1].text # 工作经验 workExperience = tag_list[0].text else: # 学历 educational = tag_list[2].text # 工作经验 workExperience = tag_list[1].text #hr名字 hrName = job.find_element(by=By.XPATH,value=".//a[@class='job-card-left']/div[contains(@class,'job-info')]/div[@class='info-public']").text #hr职位 hrWork = job.find_element(by=By.XPATH,value=".//a[@class='job-card-left']/div[contains(@class,'job-info')]/div[@class='info-public']/em").text #工作标签 workTag = job.find_elements(by=By.XPATH,value="./div[contains(@class,'job-card-footer')]/ul[@class='tag-list']/li") workTag = json.dumps(list(map(lambda x: x.text, workTag))) #是否是实习生 pratice = 0 salaries = job.find_element(by=By.XPATH,value=".//a[@class='job-card-left']/div[contains(@class,'job-info')]/span[@class='salary']").text if salaries.find('K') != -1: salaries = salaries.split('·') if len(salaries) == 1: # 薪资 salary = list(map(lambda x: int(x) * 1000, salaries[0].replace('K','').split('-'))) #年底多少月薪 salaryMonth = '0薪' else: # 薪资 salary = list(map(lambda x: int(x) * 1000, salaries[0].replace('K', '').split('-'))) # 年底多少月薪 salaryMonth = salaries[1] else: # 薪资 salary = list(map(lambda x: int(x), salaries.replace('元/天', '').split('-'))) # 年底多少月薪 salaryMonth = '0薪' pratice = 1 #公司名字 companyTitle = job.find_element(by=By.XPATH,value=".//div[@class='job-card-right']/div[contains(@class,'company-info')]/h3/a").text #公司头像 companyAvatar = job.find_element(by=By.XPATH,value=".//div[@class='job-card-right']/div[contains(@class,'company-logo')]/a/img").get_attribute("src") companyInfos = job.find_elements(by=By.XPATH,value=".//div[@class='job-card-right']/div[contains(@class,'company-info')]/ul[@class='company-tag-list']/li") if len(companyInfos) == 3: #公司性质 companyNature = companyInfos[0].text #公司状态 companyStatus = companyInfos[1].text #公司人数 companyPeoples = companyInfos[2].text if companyPeoples != '1000人以上': companyPeople = list(map(lambda x: int(x),companyInfos[2].text.replace('人','').split('-'))) else: companyPeople = [0,10000] else: # 公司性质 companyNature = companyInfos[0].text # 公司状态 companyStatus = '未融资' # 公司人数 companyPeoples = companyInfos[1].text if companyPeoples != '1000人以上': companyPeople = list(map(lambda x: int(x), companyInfos[1].text.replace('人', '').split('-'))) else: companyPeople = [0, 10000] #公司福利 companyTags = job.find_element(by=By.XPATH,value='./div[contains(@class,"job-card-footer")]/div[@class="info-desc"]').text if not companyTags: companyTags = '无' else: companyTags = json.dumps(companyTags.split(',')) #岗位详情页链接 detailUrl = job.find_element(by=By.XPATH,value='.//a[@class="job-card-left"]').get_attribute('href') #公司详情页链接 companyUrl = job.find_element(by=By.XPATH,value='.//div[@class="job-card-right"]/div[@class="company-info"]/h3/a').get_attribute('href') jobData.append(title) jobData.append(address) jobData.append(type) jobData.append(educational) jobData.append(workExperience) jobData.append(workTag) jobData.append(salary) jobData.append(salaryMonth) jobData.append(companyTags) jobData.append(hrWork) jobData.append(hrName) jobData.append(pratice) jobData.append(companyTitle) jobData.append(companyAvatar) jobData.append(companyNature) jobData.append(companyStatus) jobData.append(companyPeople) jobData.append(detailUrl) jobData.append(companyUrl) jobData.append(dist) self.save_to_csv(jobData) except: pass self.page += 1 self.main(page) #数据清洗 def clear_csv(self): df = pd.read_csv('./temp.csv') df.dropna(inplace=True) df.drop_duplicates(inplace=True) df['salaryMonth'] = df['salaryMonth'].map(lambda x: x.replace('薪','')) print("总数据为%d"%df.shape[0]) return df.values def save_to_csv(selfself,rowData): with open('./temp.csv','a',newline='',encoding='utf-8') as wf: writer = csv.writer(wf) writer.writerow(rowData) def init(self): if not os.path.exists('./temp.csv'): with open('./temp.csv','a',newline='',encoding = 'utf-8') as wf: writer = csv.writer(wf) writer.writerow(["title","address","type","educational","workExperience","workTag","salary","salaryMonth", "companyTags","hrWork","hrName","pratice","companyTitle","companyAvatar","companyNature", "companyStatus","companyPeople","detailUrl","companyUrl","dist"]) if __name__ == "__main__": spiderObj = spider('java', 1)# 职业 与 初始页面 spiderObj.init() spiderObj.main(10)#爬取11页,初始页1+爬取页10 输出结果:\a3\python.exe D:\数据可视化\main.py Traceback (most recent call last): File "D:\a3\lib\site-packages\selenium\webdriver\common\driver_finder.py", line 64, in _binary_paths raise ValueError(f"The path is not a valid file: {path}") ValueError: The path is not a valid file: C:/Users/23653/PycharmProjects/chromedriver.exe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "D:\数据可视化\main.py", line 210, in <module> spiderObj.main(10)#爬取11页,初始页1+爬取页10 File "D:\数据可视化\main.py", line 28, in main brower = self.startBrower() File "D:\数据可视化\main.py", line 22, in startBrower brower = webdriver.Chrome(service=service, options=options) File "D:\a3\lib\site-packages\selenium\webdriver\chrome\webdriver.py", line 47, in __init__ super().__init__( File "D:\a3\lib\site-packages\selenium\webdriver\chromium\webdriver.py", line 53, in __init__ if finder.get_browser_path(): File "D:\a3\lib\site-packages\selenium\webdriver\common\driver_finder.py", line 47, in get_browser_path return self._binary_paths()["browser_path"] File "D:\a3\lib\site-packages\selenium\webdriver\common\driver_finder.py", line 78, in _binary_paths raise NoSuchDriverException(msg) from err selenium.common.exceptions.NoSuchDriverException: Message: Unable to obtain driver for chrome; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors/driver_location

王奥雷
  • 粉丝: 2096
上传资源 快速赚钱