通过互联网采集

经过前两天的折腾也大概对数据采集慢慢形成概念，在表面来看人工肉眼能够检索到的数据是非常有限的，即浅网(Surface Web)，这一部分只占到互联网资源的百分之十左右，也就是搜索引擎能够抓取到的地方，至于真正的数据到底有多少，无法估量。但值得庆幸的是基于http协议的网页虽然有些部分搜索引擎抓取不到，但是爬虫可以。有了前面的一些折腾，来电新鲜的：

在采取数据之前至少得明确目标，依旧是那个熟悉的百度：

from urllib import urlopen
from bs4 import BeautifulSoup
import re
import datetime
import random
pages = set()
random.seed(datetime.datetime.now())

def getInternalLinks(bsObj, includeUrl):
    internalLinks = []
    for link in bsObj.findAll("a", href=re.compile("^(/|.*"+includeUrl+")")):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in internalLinks:
                internalLinks.append(link.attrs['href'])
    return internalLinks
def getExternalLinks(bsObj, excludeUrl):
    externalLinks = []
    for link in bsObj.findAll("a",href=re.compile("^(http|www)((?!"+excludeUrl+").)*$")):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in externalLinks:
                externalLinks.append(link.attrs['href'])
    return externalLinks
def splitAddress(address):
    addressParts = address.replace("http://", "").split("/")
    return addressParts
def getRandomExternalLink(startingPage):
    html = urlopen(startingPage)
    bsObj = BeautifulSoup(html,"html.parser")
    externalLinks = getExternalLinks(bsObj, splitAddress(startingPage)[0])
    if len(externalLinks) == 0:
        internalLinks = getInternalLinks(startingPage)
        return getNextExternalLink(internalLinks[random.randint(0,len(internalLinks)-1)])
    else:
        return externalLinks[random.randint(0, len(externalLinks)-1)]
def followExternalOnly(startingSite):
    externalLink = getRandomExternalLink("https://baidu.com")
    print("Random Page is:"+externalLink)
    followExternalOnly(externalLink)
followExternalOnly("https://baidu.com")

上面的程序从https://baidu.com开始，随机的从一个外链跳转的另一个外链。

可以看到百度周边的一些网页，文库学术什么的。

也可以添加一些新的部分：手机网站上发现的所有外链列表：

allExtLinks = set()
allIntLinks = set()
def getAllExternalLinks(siteUrl):
    html = urlopen(siteUrl)
    bsObj = BeautifulSoup(html)
    internalLinks = getInternalLinks(bsObj, splitAddress(siteUrl)[0])
    externalLinks = getExternalLinks(bsObj, splitAddress(siteUrl)[0])
    for link in externalLinks:
        if link not in allExtLinks:
            allExtLinks.add(link)
    print(link)
    for link in internalLinks:
        if link not in allIntLinks:
            print("will get:" + link)
    allIntLinks.add(link)
    getAllExternalLinks(link)
    getAllExternalLinks("https://baidu.com")

这样程序流程如下：

这里遇到一个问题：如何解决301重定向问题？（是当用户或搜索引擎向网站服务器发出浏览请求时，服务器返回的HTTP数据流中头信息(header)中的状态码的一种，表示本网页永久性转移到另一个地址。）

重定向（ redirect）允许一个网页在不同的域名下显示。重定向有两种形式：
• 服务器端重定向，网页在加载之前先改变了 URL；
• 客户端重定向，有时你会在网页上看到“ 10 秒钟后页面自动跳转到……”之类的消息，表示在跳转到新 URL 之前网页需要加载内容。
服务器端重定向，通常不用担心。如果用 Python 3.x 版本的 urllib 库，它会自动处理重定向。不过要注意，有时候要采集的页面的 URL 可能并不是当前所在页面的 URL。
尝试使用Scrapy采集

在实际的使用当中往往我们都在重复一件相同的事情：找出页面的所有的链接，跳转至新的页面，然后狗住继续执行这样的一个操作。这里可以使用Scrapy来帮助你大幅度降低网页链接查找和识别。这个Python库可以说是我py2.7的福音23333.

安装Scrapy：pip install Scrapy(如果报错超时就pip --default-timeout=100 install -U pip后再次尝试，还不行的话https://pypi.python.org/simple/scrapy/手动滑稽吧)

安装完成之后我们得先创建工作环境：

使用cmd 执行命令：scrapy startproject src 其中src是你想创建的工作路径。在哪里执行就会在哪里创建。

然后将项目导入pychrom

接下来继续：在spiders中添加一个新的.py,并修改items.py:

from scrapy.selector import Selector
from scrapy import Spider
from scr.items import Article
class ArticleSpider(Spider):
    name="article"
    allowed_domains = ["baike.baidu.com"]
    start_urls = ["https://baike.baidu.com/item/Avicii",
        "https://baike.baidu.com/item/Python_%28programming_language%29"]
def parse(self, response):
    item = Article()
    title = response.xpath('//h1/text()')[0].extract()
    print("Title is: "+title)
    item['title'] = title
    return item

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy import Item, Field
class Article(Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = Field()

然后添加begin.py作为启动：

from scrapy import cmdline

cmdline.execute("scrapy crawl article".split())

最终目录为：

开始配置信息：

此刻便可以进行运行：（若此时py的import库出现问题请自行重装：0.0因为我就是出异常了哎）

贴上来先暂存，哪天改正填平这个坑了就改掉QAQ：

Traceback (most recent call last):
  File "D:\Python\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "D:\Python\lib\site-packages\scrapy\crawler.py", line 95, in crawl
    six.reraise(*exc_info)
  File "D:\Python\lib\site-packages\scrapy\crawler.py", line 77, in crawl
    self.engine = self._create_engine()
  File "D:\Python\lib\site-packages\scrapy\crawler.py", line 102, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "D:\Python\lib\site-packages\scrapy\core\engine.py", line 69, in __init__
    self.downloader = downloader_cls(crawler)
  File "D:\Python\lib\site-packages\scrapy\core\downloader\__init__.py", line 88, in __init__
    self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
  File "D:\Python\lib\site-packages\scrapy\middleware.py", line 58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "D:\Python\lib\site-packages\scrapy\middleware.py", line 34, in from_settings
    mwcls = load_object(clspath)
  File "D:\Python\lib\site-packages\scrapy\utils\misc.py", line 44, in load_object
    mod = import_module(module)
  File "D:\Python\lib\importlib\__init__.py", line 37, in import_module
    __import__(name)
  File "D:\Python\lib\site-packages\scrapy\downloadermiddlewares\retry.py", line 20, in <module>
    from twisted.web.client import ResponseFailed
  File "D:\Python\lib\site-packages\twisted\web\client.py", line 42, in <module>
    from twisted.internet.endpoints import HostnameEndpoint, wrapClientTLS
  File "D:\Python\lib\site-packages\twisted\internet\endpoints.py", line 41, in <module>
    from twisted.internet.stdio import StandardIO, PipeAddress
  File "D:\Python\lib\site-packages\twisted\internet\stdio.py", line 30, in <module>
    from twisted.internet import _win32stdio
  File "D:\Python\lib\site-packages\twisted\internet\_win32stdio.py", line 9, in <module>
    import win32api
ImportError: No module named win32api

发表回复 取消回复

发表回复取消回复