The article introduces how to collect Chinese idioms on the Internet by python.

I chose a website that contains many Chinese idioms, http://cy.5156edu.com , to do the experiments.

We have to make sure the library BeautifulSoup installed for Python firstly, it can help us to analyze the web page. Maybe we also will use regular expressions to search the elements we want.

The general idea of the problem is to start from a web link to find Chinese idiom and its explain, then collect its internal URLs if they had not been visited, then handle the new internal URLs.

#! /usr/local/bin/python3
# -*- coding: utf-8 -*-

from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
import re

pages = set()

def GetLinks( pageUrl ):
    global pages
    if len( pages ) > 10:
        return
    html = urlopen( pageUrl )
    bs = BeautifulSoup( html, 'html.parser' )
    name = bs.find( 'td', { 'colspan': '6'} )
    print( pageUrl )
    print( name.get_text() )


    #inspect element => CSS selector:
    #
    ##table3 > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(1) > 
    #table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(5) > td:nth-child(2)
    #
    #<tr bgcolor="#ffffff">
    #<td width="15%">解释:</td>
    #<td colspan="5">虚心听取谏议</td>
    #</tr>


    desc = bs.find(id='table3').find( 'table' ).find_all( 'tr' )
    for des in desc:
        vars = des.find_all( 'td' )
        if len( vars ) > 1 and vars[0].get_text() == "解释:":
            print( vars[0].get_text(), end=" " )
            print( vars[1].get_text() )


    links = bs.find_all( 'a', href = re.compile( '^/html' ) )
    for link in links:
        newLink = link.attrs['href']
        newLink = 'http://cy.5156edu.com' + newLink
        if newLink not in pages:
            pages.add( newLink )
            GetLinks( newLink )
            #break

GetLinks( 'http://cy.5156edu.com/html4/1941.html' )

Output:

http://cy.5156edu.com/html4/1941.html
纳谏如流
解释: 虚心听取谏议
http://cy.5156edu.com/html4/562.html
从谏如流
解释: 听从直言规劝,像水从高处流下一样顺畅。形容乐意接受别人的批评意见
http://cy.5156edu.com/html4/1941.html
纳谏如流
解释: 虚心听取谏议
http://cy.5156edu.com/html4/3686.html
从令如流
解释: 从令:服从命令;如流:好象流水向下,形容迅速。形容绝对服从命令。
http://cy.5156edu.com/html4/3691.html
从善如流
解释: 从:听从;善:好的,正确的;如流:好象流水向下,形容迅速。形容能迅速地接受别人的好意见。
http://cy.5156edu.com/html4/565.html
从善如登
解释: 指为善如登山那样不易,比喻学好很难
http://cy.5156edu.com/html4/878.html
改恶从善
解释: 改去坏的、错误的,向好的、正确的方向转化
http://cy.5156edu.com/html4/3690.html
从善如登,从恶如崩
解释: 比喻学好很难,学坏极容易。
http://cy.5156edu.com/html4/15649.html
从善若流
解释: 见“从善如流”。
http://cy.5156edu.com/html4/17030.html
改行从善
解释: 见“改行为善”。
http://cy.5156edu.com/html4/17031.html
改行为善
解释: 改变不良行为,诚心向善。
Categories: Python

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Tex To PDF
: convert the Latex file which suffix is tex to a PDF file

X
0
Would love your thoughts, please comment.x
()
x